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This is appellant's reply to the Examiner's Answer dated May 3, 2005. 



Adding to the record after appeal 



37 CFR section 41.33(d)(2) says that after the date of filing an appeal, all other affidavits 
or other evidence "will not be admitted" except under particular narrow circumstances. 
The date of filing the appeal in this case was August 26, 2004. The record for this Board 
ought to be the record as it stood on August 26, 2004. 

On September 21, 2004, a date which is after the date of filing the appeal in this case, the 
Examiner mailed to the applicant a five-page document styled as an "advisory action." 
Nothing was attached to the five-page document when it was received by the applicant on 
September 23, 2004. 

Now comes the Examiner's Answer which says (Answer, page 6) that extrinsic evidence 
was attached to that five-page document. The applicant has now (in June of 2005) 
consulted USPTO's Image File Wrapper and has learned that seventeen extra pages were 
included in the materials scanned by USPTO into Image File Wrapper for the September 
21, 2004 document. Now in June of 2005, the applicant has for the first time seen the 
seventeen pages to which the Examiner refers. 

In any event, even if the Examiner had mailed the seventeen pages to the applicant back 
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on September 21, 2004, this would have been after appeal and thus ought not to be 
admitted into the record on appeal. See 37 CFR section 41.33(d)(2). 

It is thus formally requested that the September 21, 2004 document with its "other 
evidence" be stricken from the record. The five pages that were actually mailed to the 
applicant should be stricken from the record, and the seventeen pages of "extrinsic 
evidence" that were never mailed to the applicant but which were inserted into Image File 
Wrapper should likewise be stricken from the record. 

All portions of the Examiner's Answer that rely upon the September 21, 2004 "evidence" 
should likewise be stricken, and it is formally requested that such portions be stricken 
now. This includes pages 3-7 of the Answer in which the Examiner relies upon that 
"evidence" to construct the Examiner's proposed definition of "message passing," as well 
as pages 7-35 which apply the Examiner's proposed definition of "message passing" 
(which relies upon the September 21, 2004 "evidence") to the Parrish reference and to the 
rejected claims. 

Once the improper September 21, 2004 "evidence" is stricken, and once the portions of 
the Examiner's Answer which rely upon that evidence are stricken, there is little left of 
the Examiner's Answer and the appeal should be decided in the applicant's favor. 

The consequences if the Board excuses the Examiner from 
having to comply with 37 CFR section 41.33(d)(2) 

The applicant perceives some possibility that the Board may decline to strike the 
"evidence" which the Examiner attempted to add to the record after the date of the filing 
of the appeal, or stated differently perceives some possibility that the Board may choose 
to excuse the Examiner from having to comply with 37 CFR section 41.33(d)(2). Out of 
an abundance of caution, then, in the event that the Board chooses to excuse the 
Examiner from compliance with 37 CFR section 41.33(d)(2), the applicant requests that 
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the applicant likewise be excused from compliance with 37 CFR section 41.33(d)(2). 
Stated differently, if the Examiner is to be permitted to add to the record after the appeal 
with the Examiner's proffered extrinsic evidence as to the meaning of "message passing," 
then equity requires that applicant's proffered evidence as to the meaning of "message- 
passing communications network" likewise be admitted. 

The additional evidence is as follows: 

• Attached as Exhibit A is page 640 from Computer Architecture - A Quantitative 
Approach by Patterson and Hennessy (Morgan Kaufmann Publishers, Inc., San 
Francisco 1990, Second Edition 1996). 

• Attached as Exhibit B is an article by Gordon Bell from Communications of the ACM, 
August 1992, Vol. 35, No. 8. 

• Attached as Exhibit C is an affidavit of the inventor, Anton Gunzinger, attesting that 
Exhibits A and B are true and correct copies of what they purport to be (affidavit, 
paragraphs 6 and 7 respectively). 

Argument 

The application relates generally to parallel computing, and it is important that the 
interpretation of terms in the application be performed by one skilled in the art of parallel 
computing. In parallel computing, a way of sub-dividing computer systems is by the way 
the entities making up a parallel computer system communicate. In this respect, there are 
two fundamentally different and disjoint categories, namely the "shared memory" 
category and the "message passing" category. 

Put simply, the cited reference Parrish falls into the "shared memory" category, and the 
rejected claims fall into the "message passing" category. Parrish is non-analogous art, 
and is unavailable as a reference against the present claims for that reason. The rejection 
should be reversed for that reason. 
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The "special, secret, definition" of "message-passing communications network" 

The Examiner's Answer, at pages 3-7, puts forth the view that the term "message-passing 
communications network," as used in the specification and in the rejected claims, is 
unclear and undefined. The Answer says that the applicant is using a "special, secret, 
definition" of this term (Answer, page 3, section 9). 

The Answer then presents its own proposed definition of this term, a proposed definition 
that was never put forth in any Office action heretofore. To arrive at this proposed 
definition, the Answer deconstructs the term into four sub-parts (Answer, page 6) and 
then cobbles together bits and pieces from definitions of each of the four sub-parts from 
four different dictionaries (Answer, pages 6-7). In the pages which follow (Answer, 
pages 7-42) the Examiner applies this proposed definition of "message -passing 
communications network" to the claims and to the cited reference Parrish in an effort to 
show that each and every claim is not merely anticipated but supposedly "clearly 
anticipated by Parrish. 

By the time this appeal is heard, more than eight years will have passed since this 
applicant first applied for a patent on the applicant's invention. During those eight years, 
there have been six office actions on the merits: 

• June 7, 2000 

• March 13, 2001 

• July 3, 2002 

. February 13, 2003 

• March 28, 2003 

• March 24, 2004 

Never in any of these six Office Actions did the Examiner put forth the proposed 
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definition of "message-passing communications network" that has now been put forth for 
the first time in this Examiner's Answer. Never in any of these six Office Actions did the 
Examiner accuse the applicant of using a "special, secret, definition" of this term. 

In the Final Rejection dated March 24, 2004 (which is the Final Rejection now being 
appealed), the Examiner raised no issue at the supposed meaninglessness of the term 
"message passing communications network" as used in the claims. In that Final Office 
Action, the Examiner said at paragraph 7, "Panish et al. at col. 4 lines 3-8 indicates that 
the 'bus' of his invention is a message passing communications network. ... the text of 
Parrish et al. indicates that the bus is a message passing communications network ... and 
describes that same message passing communications network. ... the bus of Parrish et al. 
is indeed a message passing communications network such as that claimed in the present 
application." 

Likewise in the February 13, 2003 Office Action, the Examiner felt that "message passing 
communications network" was a term with a clear meaning permitting the Examiner to 
arrive at a view as to whether the reference Brantley Jr. et al. had such a network. (The 
Examiner's view was in the affirmative.) The Examiner expressed the view that Brantley 
Jr. et al. "taught ... the invention as claimed ... the communications managers of the at 
least first and second processor elements communicatively coupled by means of a 
message-passing communications network (fig. 1, 10) ..." (Office Action paragraph 4.) 
The Examiner further expressed the view that two additional claim limitations employing 
the term "message-passing communications network" could supposedly be found in 
Brantley Jr. et al. 

In the March 28, 2003 Office Action at paragraph 4, the Examiner expresses the view that 
the "message-passing communications network" could somehow be found in the Parrish 
reference, at "fig. 2, 160, fig. 8b-8d, 460". 

In the March 24, 2004 Office Action, the Examiner felt that "message passing 
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communications network" was a term with a clear meaning permitting the Examiner to 
arrive at a view as to whether the reference Parrish had such a network. (The Examiner's 
view was in the affirmative.) 

For the past two years, the Examiner has steadfastly used the term "message-passing 
communications network" as a term with a clear meaning, a term which could be applied 
to both the cited reference Brantley Jr. et al. and to the cited reference Parrish et al., with 
the Examiner opining as to whether the "message-passing communications network" 
could be found in whichever cited reference was being discussed. 

Now, after two years during which the term "message-passing communications network" 
had a clear meaning according to the Examiner, the Examiner reverses himself and says it 
had no meaning other than applicant's "special, secret, definition" or the new proposed 
cobbled-together definition presented for the first time in the Examiner's Answer. 

What "message-passing communications network" means 

One skilled in the relevant art would know what "message-passing communications 
network" means. One of the widely known standard textbooks in the field of computer 
architecture, which includes a chapter on parallel computing, has been Computer 
Architecture - A Quantitative Approach by Patterson and Hennessy (Morgan Kaufmann 
Publishers, Inc., San Francisco 1990, Second Edition 1996) (affidavit, paragraph 6). A 
true and correct copy of page 640 of this textbook is attached as Exhibit A (id.). This 
page distinguishes between "shared memory" parallel processing systems, and "message- 
passing" parallel processing systems. See for example the underlined sentence in this 
Exhibit. 

With each of these organizations for the address space, there is an associated 
communication mechanism. For a machine with a shared address space, that 
address space can be used to communicate data implicitly via load and store 
operations; hence the name shared memory for such machines. For a machine 
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with multiple address spaces, communication of data is done by expressly passing 
messages among the processors. Therefore, these machines are often called 
message passing machines , 

(emphasis added.) 

The article by Gordon Bell (Exhibit B) likewise reviews the taxonomy (naming 
terminology) used to distinguish between shared-memory systems and message-passing 
systems, particularly at page 37. This article is dated 1992. 

The inventor's own affidavit (attached as Exhibit C) repeats the distinctions drawn in 
Exhibits A and B between "shared-memory" and "message-passing" in parallel 
processing. As the inventor states under oath: 

In parallel computing, a plurality of linked entities (processors or computers) are 
communicatively coupled. One way of sub-dividing computer systems is by the 
way the entities making up a parallel computer system communicate. In this 
respect, there are two fundamentally different categories: the "shared memory" 
category and the "message passing" category. These two categories distinguish the 
communications methods by the way the "address space" is administered and in 
this way differentiate between ways information present within one entity is 
transmitted to an other entity. The terms "shared memory" and "message passing" 
have been used to distinguish the named two categories at least since the early 
1990's and have been known to the skilled person in the field of parallel 
computing. 

Thus, at the time the first application in this family was filed (1997) the term "message 
passing" had a clear and unambiguous meaning, the Examiner's post-appeal views to the 
contrary. The clear and unambiguous meaning of "message passing" is distinct from the 
meaning of "shared memory" in this context. 

As mentioned above, although the Examiner's post-appeal view is that "message passing" 
is unclear, and although the Examiner's post-appeal view is that only the applicant's 
"special, secret, definition" would permit the applicant to prevail, the Examiner's conduct 
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in the preceding two years belied this post-appeal view. As mentioned above, in Office 
Actions dated February 13, 2003, March 28, 2003, and March 24, 2004, the Examiner 
raised no question as to the definiteness of the term and indeed had no difficulty applying 
this apparently well-defined term to cited references, purporting to find this limitation in 
those references. It is only after appeal that the Examiner shifts to a view that the term is 
indefinite and requires extrinsic evidence to understand. 

In sum, not only does the affiant (the inventor) state that the term had a clear and definite 
meaning in 1997, but the Examiner himself expressed that view in 2003 and 2004. 

As a further indication that "message passing" is not a "special, secret" term but is a very 
well known term, this Board is invited to take judicial notice that when the term 
"message passing" (in quotation marks) is entered into Google, there are over three- 
quarters of a million hits. The first ten hits all relate to "message passing" in the context 
used here, namely in a parallel processing system. So do the second ten hits and the third 
ten hits. 

The art rejection 

The rejection now presented on appeal is based on a patent by Parrish et al (US 
5,117,350) and a patent application (also by Parrish) incorporated therein by reference. 
The Examiner attempts to treat a "shared memory" system as being the same as a 
"message-passing" system, and in so doing, does violence to the long-settled fact that 
these two types of system are not the same. (See Exhibits A, B, and C and discussion 
above relating thereto.) 

Parrish is discussed in detail in the Applicant's Appeal Brief and that detailed discussion 
will not be unnecessarily repeated here. Briefly, the Parrish system concerns processors 
on a common bus. Buses are not message-passing networks but work in a more direct 
manner by working directly in a write and read (or 'load' and 'store') manner. (Exhibit 
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A.) This is why the Parrish system has a Distributed Memory Architecture, as becomes 
clear from the title of the Parrish reference and, for example from the fact that the system 
comprises a system address space (see abstract thereof). The "summary of the invention" 
section (especially col. 4, lines 51-54) also makes that aspect of Parrish clear ("... 
distributed system architecture..."). 

According to the claimed invention, however, processors are coupled by a message- 
passing communications network. 

Distributed Memory Architectures and Message Passing Architectures are distinct in the 
way the address space is set up. A distributed memory architecture is an example of a 
shared-memory architecture, where a single address space is shared by the constituents. In 
a message-passing architecture, in contrast, there is no such common address space. 
(Exhibits, A, B at page 37, and C at paragraph 5.) 

The Parrish system further comprises a common bus for all constituents, whereas the 
present claimed application defines that the processor elements are coupled by means of a 
message-passing network. Further, in some of the presently rejected independent claims 
(33 and 34), local memories of at least the first and second processor elements are 
explicitly defined not to be on a common bus (see Applicant's Appeal Brief at 14). 

By teaching a common bus, the two Parrish references teach even more away from the 
claimed invention. 

In the Examiner's Answer, the Examiner claims that "message passing" is neither defined 
in the claims nor in the specification and is therefore to be attributed the broadest possible 
meaning. It is correct that the term "message passing" is not defined in the specification 
or claims. However, as becomes clear from the specification (page 1), it need not be 
defined, since it is a term commonly used in the field. Well-known terms with a well- 
defined meaning need not be defined within the body of a patent application. By way of 
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comparison, the terms "processor" and "memory" and "program" are not defined in the 
application as filed, yet somehow one skilled in the art is able to divine their meaning. 
Such terms need not be defined because one skilled in the art knows them. 

The Examiner had plenty of opportunities to raise the issue of alleged ill-definedness of 
the term "message passing" as concerning a communications network. 

If the Examiner had, during prosecution, raised that issue, it would have been easy for the 
applicant to submit appropriate standard textbook passages explaining the meaning and 
showing that "message passing" is a well-known term in the art. This supposed issue 
could have been raised in the June 7, 2000 Office Action, or the March 13, 2001 Office 
Action, or the July 3, 2002 Office Action, or the February 13, 2003 Office Action, or the 
March 28, 2003 Office Action, or the March 24, 2004 Office Action. Likewise even if 
one were not himself or herself skilled in the relevant art, one could easily check the 
meaning oneself, for example by means of the freely accessible online-encyclopedia 
Wikipedia (en.wikipedia.org) which provides a short definition, from which it is clear 
that "message passing" is a style of parallel programming which is an alternative to 
shared memory. 

Stating the points above in a different way, a shared memory parallel programming 
system as the one disclosed in the Parrish references can not be at the same time a 
message passing system, since "message passing" is defined as an alternative to "shared 
memory." 

It bears noting that the applicant in the "background of the invention" section, on page 1 
of the specification, clearly explains that the term "message passing" is a term generally 
used for classifying parallel computer systems and that the classification used is by 
Gordon Bell himself. 
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Not on a common bus 



Referring to the feature "not on a common bus", the Examiner states that in Figure 2 of 
the Parrish reference, one can see that the VME Bus and the VSB Bus are not common to 
the local memories. From this the Examiner concludes that the memories are not on a 
common bus. This is simply not so. The fact that there are buses in the Parrish 
architecture, is not pertinent. The Interconnect bus 160 is a common bus. Therefore, the 
local memories are on a common bus, even though there may be further buses that are not 
common. 



Nearly the entirety of the Examiner's Answer relies upon "evidence" which the Examiner 
purports to have added to the record after the date of filing of the notice of appeal, and 
which ought now to be stricken. 

The Examiner argues that the applicant relies on a "secret and hidden definition" for 
message passing communications network. This is not true. The definition of this term is 
now, and was at the time of the filing of this application, (a) well-known to one skilled in 
the art, and (b) documented in standard textbooks and the Internet. 

The rejection of the claims over Parrish should be reversed with a direction that the 
application pass to issue. 

Respectfully submitted, * A 



Carl Oppedahl 

PTO Reg. No. 32,746 

Oppedahl & Larson LLP 

P O Box 5068 

Dillon, CO 80435-5068 



Conclusion 
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Chapter 8 Multiprocessors 



Models tor Communication and Memory Architecture 

As discussed earlier, any large-scale multiprocessor must use multiple memories 
that are physically distributed with the processors. There arc two alternative ar- 
chiiecturtil approaches that differ in chc method used for communicating data 
among processors. The physically separate memories can be addressed as one 
logically shared address space, meaning that a memory reference can be made by 
any processor to any memory location, assuming it has the correct access rights. 
These machines arc called distributed shared-memory (£>SM) or scalable shared- 
memory architectures. The term shared memory- refers to the fact that the address 
space is shared; that is, the same physical address on two processors refers to the 
same location in memory. Shared memory does not mean that there is a single, 
centralized memory. In contrasts the centralized memory machines, also known'- 
as UMAs (uniform memory access), the DSM machines are also called NUMAs, 
non-uniform memory access, since the access time depends on the location of a 
data word in memory. 

Alternatively, the address space can consist of multiple private address spaces 
that are logically disjoint and cannot be addressed by a remote processor. In such 
machines, the same physical address on two different processors refers to two dif- 
ferent locations in two different memories. Each processor-memory module is es- 
sentially a separate computer; therefore these machines have been called 
multicomputer*. As pointed out in the concluding remarks of the previous chap- 
ter, these machines can even be completely separate computers connected on a 
local area network. For applications that require lictle or no communication and 
can make use of separate memories, such clusters of machines, whether in a clos- 
et or on desktops, can form a very cost-effective approach. 

With each of these organizations for the address space, there is an associated 
communication mechanism. For a machine with a shared address space, that ad- 
dress space can be used to communicate data Implicitly via load and store opera- 
lions; hence the name shared memory for such machine s. For a ma chine with 
multiple address spaces, communication of data is done by explicitly oasainT 
ancjsajgs_ajriong the processors. T herefore, these' machines are often callJi 
message passing machines . 



IJUSiSsagejjassj^^ communication occurs^y~7widm^ln^SgeT 

that reqwst^ctlSTor^eTi^^ dis . 
cussed in section 7.2. For example, if one processor wamTtoTccessoTopeiae-QTr 
data m a remote memory, it can send a message to request the data or to perform 
some operation on the data. In such cases, the message can be thought of as a 
remote procedure call {RPQ. When the destination processor receives the mes- 
sage, either by polling for it or via an interrupt, it performs the operation or ac- 
cess on behalf of the rembte processor and returns the result with a reply 
message. This type of message passing is also called synchronous, since the initi- 
ating processor sends a request and waits until the reply is returned before 
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ILTRACOM PITERS 

A Teraf lop Before its Time 




he quest for the 
Teraflops Super- 
computer to operate at 
a peak speed of 10 12 
floating - point opera- 
tions per sec is almost a decade 
old, and only one three-year 
computer generation from being 
fulfilled. The acceleration of its 
development would require an 
ultracomputer. First- generation, 
ultracomputers are networked 
computers using switches that 
interconnect thousands of com- 
puters to form a multicomputer, 
and cost $50 to $300 million in 
1992. These scalable computers 
are also classified as massively 
parallel, since they can be con- 
figured to have more than 1,000 
processing elements in 1992. 
Unfortunately, such computers 



are specialized since only highly 
parallel, coarse-grained applica- 
tions, requiring algorithm and 
program development, can 
exploit them. Government pur- 
chase of such computers would 
be foolish, since waiting three 
years will allow computers with 
a peak speed of a teraflop to be 
purchased at supercomputer 
prices ($30 million), due to 
advancements in semiconductors 
and the intense competition 
resulting in "commodity super- 
computing." More important, 
substantially better computers 
will be available in 1995 in the 
supercomputer price range if the 
funding that would be wasted in 
buying such computers is instead 
spent on training and software to 
exploit their power. 
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In 1989 I described the situation 
in high-performance computers in 
science and engineering, including 
several parallel architectures that 
could deliver teraflop power by 
1995, but with no price constraint 
[2]. I predicted either of two alter- 
natives: SIMDs with thousands of 
processing elements or multicom- 
puterswith 1,000+ interconnected, 
independent computers, could 
achieve this goal. A shared-memory 
multiprocessor looked infeasible 
then. Traditional, multiple vector 
processor supercomputers such as 
Crays would simply not evolve to a 
teraflop until 2000. Here is what 
happened. 

1. During 1992, NEC's four- 
processor SX3 is the fastest com- 
puter, delivering 90% of its peak 
22G flops for the Linpeak bench- 
mark, and Cray\s I6-processor 
YMP C90 provides the greatest 
throughput for supercomputing 
workloads. 

2. The SIMD hardware approach 
that enabled Thinking Machines to 
start up in 1983 and obtain DARPA 
funding was abandoned because it 
was only suitable for a few, very 
large-scale problems, barely mul- 
tiprogrammed, and uneconomical 
for workloads. It is unclear whether 
large SIMDs are "generation"- 
scalable, and they are clearly not 
"size"-scalable. The main result of 
the CM2 computer was lOGflop- 
level of performance for large-scale 
problems. 

3. Ultracomputer-sized, scalable 
multicomputers (smC) were intro- 
duced by Intel and Thinking Ma- 
chines, using "Killer" CMOS, 32-bit 
microprocessors. These product 
introductions join multicomputers 
from companies such as Alliant, 
AT&T, IBM, Intel, Meiko, Mer- 
cury, NCUBE, Parsytec, and 
Transtech. At least Convex, Cray, 
Fujitsu, IBM, and NEC are working 
on new-generation smCs that use 
(>4-bit processors. By 1995, this 
score of efforts, together with the 
evolution of fast, LAN-connected 
workstations will create "commod- 
ity supercomputing." The author 




advocates workstation clusters 
formed by interconnecdng high- 
speed workstations via new high- 
speed, low-overhead switches, in 
lieu of special-purpose multicom- 
puters. 

4. Kendall Square Research intro- 
duced their KSR 1 scalable, shared- 
memory multiprocessors (smP) 
with 1,088 64-bit microprocessors. 
It provides a sequentially consistent 
memory and programming model, 
proving that smPs are feasible. The 
KSR breakthrough that permits 
scalability to allow it to become an 
ultracomputer is based on a distrib- 
uted, memory scheme, 
A LLC AC HE™ that eliminates 
physical memory addressing. The 
ALLCACHE design is a confluence 
of cache and virtual memory con- 
cepts that exploit locality required 
by scalable, distributed computing. 
Work is not bound to a particular 
memory, but moves dynamically to 
the processors requiring the data. A 
multiprocessor provides the great- 
est and most flexible ability for 
workload since any processor can 
be deployed on either scalar or par- 
allel (e.g., vector) applications, and 
is general-purpose, being equally 
useful for scientific and commercial 
processing, including transaction 
processing, databases, real time, 
and command and control. The 
KSR machine is most likely the 
blueprint for future scalable, mas- 
sively parallel computers. 

Figure I shows the evolution of 
supers (four- to five-year gestation) 
and micro-based scalable comput- 
ers (three-year gestation). In 1992, 
petaflop (10 ,fi flops) ultracom- 
puters, costing a half-billion dollars 
do not look feasible by 2001. Den- 
ning and Tichy [7] argue that sig- 
nificant scientific problems exist to 
be solved, but a new approach may 
be needed to build such a machine. 
1 concur, based on results to date, 
technology evolution, and lack of 
user training. 

The teraflop quest is fueled by 
the massive (gigabuck-Ievel) High 
Performance Computing and Com- 
munications Program (HPCC, 



1992) budget and DARPA's mili- 
tary-like, tactical focus on teraflops 
and massive parallelism with 
greater than 1,000 processing ele- 
ments. The teraflops boundary is 
no different than advances that cre- 
ated electronic calculators 
(kiloflops), Cray computers 
(megaflops), and last-generation 
vector supercomputers (Gflops). 
Vector processing required new 
algorithms and new programs, and 
massively parallel systems will also 
require new algorithms and pro- 
grams. With slogans such as "indus- 
trial competitiveness," the teraflop 
goal is fundable — even though 
competitiveness and teraflops are 
difficult to link. Thus, HPCC is a 
bureaucrat's dream. Gigabuck pro- 
grams that accelerate evolution are 
certain to trade off efficacy, bal- 
anced computing, programmabil- 
ity, users, and the long term. Al- 
ready, government-sponsored 
architectures and selected purchas- 
ing have eliminated benchmarking 
and utility (e.g., lacking mass stor- 
age) concerns as DARPA focus nar- 
rowed on the teraflop. Central pur- 
chase of an ultracomputer for a 
vocal minority wastes resources, 
since no economy of scale exists, 
and potential users are not likely to 
find or justify problems that effec- 
tively utilize such a machine with- 
out a few years of use on smaller 
machines. 

Worlton describes the potential 
risk of massive parallelism in terms 
of the "bandwagon effect," where 
we make the biggest mistakes in 
managing technology [23], The ar- 
ticle defines "bandwagon" as "a 
propaganda device by which the 
purported acceptance of an idea, 
product or the like by a large num- 
ber of people is claimed in order to 
win further public acceptance." He 
describes a massively parallel band- 
wagon drawn by vendors, com- 
puter science researchers, and bu- 
reau era is who gain power by 
increased funding. Innovators and 
early adopters are the riders. The 
bandwagon's four flat tires are 
caused by the lack of systems soft- 
ware, skilled programmers, guide- 
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posts (heuristics about design and 
use), and parallelizable applica- 
tions. 

The irony of the teraflops quest 
is that programming may not 
change very much even though vir- 
tually all programs must be rewrit- 
ten to exploit the very high degree 
of parallelism required for efficient 
operation of the coarse-grained, 
scalable computers. Scientists and 
engineers will use just another dia- 
lect of Fortran that supports data 
parallelism. 

All computers, including true 
supers, use basically the same, 
evolving, programming model for 
exploiting parallelism: SPMD, a 
single program, multiple data 
spread across a single address space 
that supports Fortran [16]. In fact, 
a strong movement is directed to- 
ward the standardization of High 
Performance Fortran (HPF) using 
parallel data structures to simplify 
programming. With SPMD, the 
same program is made available to 
each processor in the system. 
Shared memory multiprocessors 
simply share a copy in common 
memory and each computer of a 
multicomputer is given a copy of 
the program. Processors are syn- 
chronized at the end of parallel 
work units (e.g., outermost DO 
loop). Multicomputer, however, 
have several sources of software 
overhead due to communication 
being message-passing instead of 
direct , memory reference. With 
SPMD and microprocessors with 
(>4-bil addressing, multicomputer 
will evolve to be the multiprocessors 
they simulate by 1995. Thus, the 
mainline, general-purpose com- 
puter is almost certain to be the 
shared memory, multiprocessor 
alter 19!I5. 

The article will first describe 
supercompuling evolution and the 
importance of size-, generation-, 
and problem-scalability to break the 
evolutionary performance and 
price barriers. A discussion about 
measuring progress will follow. A 
taxonomy of alternatives will be 
given to explain the motivation for 
the multiprocessor continuing to be 




the mainline, followed by specific 
industrial options that illustrate real 
trade-offs. The final sections de- 
scribe computer design research 
activities and the roles of computer 
and computational science, and 
government. 

Evolution to the 
Ultracomputer; a scalable 
Supercomputer 

Machine scalability allows the $30 
million price barrier to be broken 
for a single computer so that for 
several hundred million dollars' or 
a teraflop's worth of networked 
computers, the ultracomputer, can 
be assembled. Until 1992, a super- 



Figure 1. Performance (Cfiops) of 
Cray and NEC supercomputers, and 
Cray, Intel, and Thinking Machines scal- 
able computers vs. Introduction date 



computer was defined both as the 
most powerful central computer 
for a range of numerically intense 
computation (i.e., scalar and vector 
processing), with very large data 
sets, and costing about $30 million. 
The notion' of machine or "size" 
scalability, permitting computers of 
arbitrary and almost unlimited size, 
together with finding large-scale 
problems that run effectively have 
been key to the teraflop race [10]. 
This is a corollary of the Turing 
test: People selling computers must 
be smarter than their computers. 
No matter how difficult a computer 
is to use or how poorly it performs 
on real workloads, given enough 
time, someone may find a problem 
for which the computer performs 
well. The problem owner extols the 
machine to perpetuate government 
funding. 

In 1988, the Cray YMP/8 deliv- 
ered a peak of 2.8 Gflops. .By 1991, 




COMMUNICATIONS OF THf ACM/ AiitfitM 1'l<l'J Aol.:ir>. Ntt.fl 



the Intel Touchstone Delta (672 
node multicomputer) and the 
Thinking Machines CM2 (2K pro- 
cessing element SIMD), both began 
to supply an order of magnitude 
more peak power (20 gigaflops) 
than supercomputers. In super- 
computing, peak or advertising 
power is the maximum perfor- 
mance that the manufacturer guar- 
antees no program will exceed. 
Benchmark kernels such as matrix 
operations run at near peak speed 
on the Cray YMP. Multicomputer 
require 0(25,000) matrices to oper- 
ate effectively (e.g., 14 Gflops from 
a peak of 20), and adding proces- 
sors does not help. For 0(1,000) 
mairices that are typical of super- 




computer applications, smCs with 
several thousand processing ele- 
ments deliver negligible perfor- 
mance. 

Supers 1992 to 1995 
By mid- 1992 a completely new gen- 
eration of computers have been in- 
troduced. Understanding a new 
generation and redesigning it to be 
less flawed takes at least three years. 
Understanding this generation 
should make it possible to build the 
next-generation supercomputer 
class machine, that would reach a 
teraflop of peak power for a few, 
large-scale applications by the end 
of 1995. 



Table 1 shows six alternatives for 
high-performance computing, 
ranging from two traditional su- 
pers, one smP, and three "commod- 
ity supers" or smCs, including 
1,000 workstations. Three metrics 
characterize a computer's perfor- 
mance and workload abilities. Lin- 
peak is the operation rate for solv- 
ing a system of linear equations and 
is the best case for a highly parallel 
application. Solving systems of lin- 
ear equations is at the root of many 
scientific and engineering applica- 
tions. Large, well-programmed 
applications typically run at one- 
fourth to one-half this rate. Lin- 
pack IK x IK is typical of prob- 
lems solved on supercomputers in 
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1992. The Livermore Fortran Ker- 
nels (LFR) harmonic mean for 24 
loops and 3 sizes, is used to charac- 
terize a numerical computer's abil- 
ity, and is the worst-case rating for a 
computer as it represents an un- 
tuned workload. 

New-generation, traditional or 
"true" multiple vector processor 
supercomputers have been deliv- 
ered by Cray and NEC that provide 
one-fourth to one-eighth the peak 
power of the smCs to be delivered 
in 1992. "True" supercomputers 
use the Cray design formula: ECL 
circuits and dense packaging tech- 
nology to reduce size, allow the 
fastest clock; one or more pipelined 
vector units with each processor 
provide peak processing for a For- 
tran program; and multiple vector 
processors communicate via a 
switch to a common, shared mem- 
ory to handle large workloads and 
parallel processing. Because of the 
dense physical packaging of high- 
power chips and relatively low den- 
sity of the 100,000 gate ECL chips, 
the inherent cost per operation for 
a supercomputer is roughly 500 to 
1,000 peak flops/$ or 4 to 10 times 
greater than simply packaged, 2 
million transistor "killer" CMOS 
microprocessors that go into lead- 
ing edge workstations (5,000 peak 
flops/$). True supercomputers are 
not in the teraflops race, even 
though they are certain to provide 
most of the supercomputing capac- 
ity until 1995. 

Intel has continued the pure 
multicomputer path by introducing 
its third generation, Paragon with 
up to 4K Intel i860 microprocessor 
based nodes, each with a peak of 
4 X 75 M flops for delivery in early 

1993. Intel is offering a 6K node, 
JH.00 million, special ultracomputer 
for delivery in late 1993 that would 
provide 1.8 peak teraflops or 6K 
peak Hop/}!. 

hi October 1991, Thinking Ma- 
chines Corp. (TMC) announced its 
first-generation multicomputer 
consisting of Sun servers control- 
ling up to 16K Sparc micropro- 
cessor-based computational nodes, 
each with four connected vector 




processors to be delivered in 1992. 
The CM5 workload ability of a few 
Sun servers is small compared to a 
true supercomputer and to begin to 
balance the computer for general 
utility would require several disks at 
each node. The expected perfor- 
mance for both supercomputer- 
sized problems and a workload 
means that the machine is funda- 
mentally special-purpose for highly 
parallel jobs. The CMS provides 
4,300 peak flops/}. 

In both the Paragon and CMS it 
is likely that the most cost-effective 
use will be with small clusters of a 
few (e.g., 32) processors. 

1995 supers: Vectors, Scalable 
Multicomputer* or 
Multiprocessors 

Traditional or "true" supercomput- 
ers have a significant advantage in 
being able to deliver the computa- 
tional power during this decade 
because they have evolved for four, 
four-year generations for almost 20 
years,' and have an installed soft- 
ware-based, programming para- 
digm, trained programmers, and 
wider applicability inherent in finer 
granularity. The KSR-1 scalable 
multiprocessor runs traditional, 
fine-grained supercomputer For- 
tran programs, and has extraordi- 
nary single-processor scalar and 
commercial (e.g., transaction pro- 
cessing) throughput. 

The smCs are unlikely to be al- 
ternatives for general-purpose 
computing or supercomputing be- 
cause they do not deliver significant 
power for scalar- and finer-grained 
applications that characterize a 
supercomputer workload. For ex- 
ample, the entire set of accounts 
using the Intel smC at Cal Tech is 
less than 200, or roughly the num- 
ber of users that simultaneously use 
a large super. Burton Smith [20] 
defines a general-purpose com- 
puter as: 1. Reasonably fast execu- 
tion of any algorithm that performs 
well on another machine. Any kind 

'Cray 1 (1975), Cray IS (1978), Cay XMP-2. 
4 (1982, 1984), Cray YMP-8 (1988), and Cray 
C-90O992) 



of parallelism should be exploit- 
able. 2. Providing a machine-inde- 
pendent programming environ- 
ment. Software should be no 
harder to transport than to any 
other computer. 3. A storage hier- 
archy performance consistent with 
computational capability. The com- 
puter should not be I/O bound to 
any greater extent than another 
computer. 

Whether traditional supercom- 
puters or massively parallel com- 
puters provide more computing, 
measured in flops/month by 1995 is 
the object of a bet between the au- 
thor and Danny Hillis of Thinking 
Machines [11]. Scalable multicom- 
puter (smCs) are applicable to 
coarse-grained, highly parallel 
codes and someone must invent 
new algorithms and write new pro- 
grams. Universities are rewarded 
with grants, papers, and the pro- 
duction of knowledge, Hence, they 
are a key to utilizing coarse- 
grained, parallel computers. With 
pressure to aid industry, the De- 
partment of Energy laboratories 
see massive parallelism as a way to 
maintain staffs. On the other hand, 
organizations concerned with cost- 
effectiveness, simply cannot afford 
the effort unless they obtain 
uniquely competitive capabilities. 

Already, the shared, virtual 
memory has been invented to aid in 
the programming of multicomput- 
er. These machines are certain to 
evolve to multiprocessors with the 
next generation. Therefore, the 
mainline of computing will con- 
tinue to be an evolution of the 
shared memory multiprocessor just 
as it has been since the mid-1960s 
[4], In 1995, users should be able to 
buy a scalable parallel multiproces- 
sor for 25K.peak flops/$, and a 
teraflop computer would sell for 
about $40 million. 



Measuring Progress 

Supercomputer users and buyers 
need to be especially cautious when 
evaluating performance claims for 
supercomputers. Bailey's [1] twelve 
ways to obfuscate are: 
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1. Quote 32-bit performance re- 
sults as 64-bit performance 

2. Present inner kernel perfor- 
mance as application performance, 
neglect I/O 

3. Employ assembly, micro-code 
and low-level code 

4. Scale up problem size to avoid 
performance drop-off when using 
large numbers of processors with 
overhead or inter-communication 
delays or limits 

5. Quote performance results ex- 
trapolated to a full system based on 
one processor 

6. Compare results against un- 
optimized, scalar Cray code 

7. Compare direct run-time with 
old code on an obsolete system 
(e.g., Cray 1) 

8. Quote additional operations 
that are required when using a par- 
allel, often obsolete, algorithm 

9. Quote processor utilization, 
speedup, or MHops/$ and ignore 
performance 

10. Mutilate the algorithm to 
match the architecture, and give 
meaningless results 

11. Compare results using a loaded 
vs. dedicated system 

12. If all else fails, show pictures 
and videos 

Each year a prize administered by 
the ACM and IEEE Supercomput- 
ing Committees awards a prize to 
reward practical progress in paral- 
lelism, encourage improvements, 
and demonstrate the utility of par- 
allel processors [9]. Various prize 
categories recognize program 
speedup through parallelism, abso- 
lute performance, performance/ 
price, and parallel compiler ad- 
vances. The first four years of 
prizes are given in Table 2. 

The H)H7 prize for parallelism 
was won by a learn at Sandia Na- 
tional laboratory using a IK node 
NCUHK and solving three prob- 
lems. The leant extrapolated that 
with more memory, the problem 
could be sc aled up to reduce over- 
head, and a factor oi 1.000 (vs. (>()()) 
speedup could be achieved. An 
NCAR Atmospheric Model run- 
ning on the Cray XMP had the 




highest performance. In 1989 and 
1990, a CM2 (SIMD with 2K pro- 
cessing elements) operated at the 
highest speed and the computation 
was done with 32-bit floating point 
numbers. The problems solved 
were 4 to 16 times larger than 
would ordinarily have been solved 
with modified problems requiring 
additional operations [1]. 

Benchmarking 

The benchmarking process has 



scalability 

The perception that a 
computer can grow for- 
ever has always been a 
design goal (e.g., IBM System/ 
360 IC1964) provided a 100:1 
range, and VAX existed at a r 
range of 1,000:1 over Itsilfe-' ^ 
time), ideally, one would^f$ ; ^ 
with a single comput^ ahdr^- • 
buy more components as^^v^ :; 
needed to provldesiz£ i 
scalability, similarly, when new 
processor technology Increased 
performance,' one would add 
new-generation computers In a 
generations-scalable fashion. 
Ordinary workstations provide 
some size and generation 
scalability, but are LAN-limited. 
By providing suitable high- 
speed switching, workstation 
clusters can supply parallel 
computing power and are an 
alternative to scalable multi- 
computers. Problem scalability 
Is the ability of a problem, al- 
gorithm, or program to exist at 
a range of sizes so it can be 
used efficiently and correctly 
on a given, scalable computer. 

WorJton 1231 discusses Am- 
dahl's law and the need for a 
very large fraction, F, of a given 
program to be parallel, when 
using a large number of proc- 
essors, N to obtain high effi- 
ciency, £<F,/V). 

HF ) /V)*1/<F+/Vxr1 -F)) 

Thus, scaling up slow proces- 
sors Is a losing proposition for 
a given fraction of parallelism. 



been a key to understanding com- 
puter performance until the 
teraflop started and peak perfor- 
mance replaced reality as a selec- 
tion criterion. Computer perfor- 
mance for an installation can be 
estimated by looking at various 
benchmarks of similar programs, 
and collections of benchmarks that 
represent a workload [5]. Bench- 
marks can be synthetic (e.g., Dhry- 
stones and Whetstones), kernels 
that represent real code (e.g., Liv- 
ermore Loops, National Aerody- 



For an efficiency of 50%, re- • 
quires 1 - F = 1/(W-1); for 
1,000 processors F must be 
0.999 parallel. 

Size Scalability: 
Locality is the Key 

Size scalability has become an 
;acadehilc topic 112, 201. size 
^alablll^slmpl^means that a 
ve^la^^ as 
the ultra^ be built 

?T^qafcd^ 

; dgnize and ^; . 

■computer i|#apcal (aff qr$8$ . 
able) in a reasonable time scaW 
For example, a cross- point 
switch Is supposedly not scal- 
able because the switching 
area grows 0(n2) even though 
switch cost may be Insignifi- 
cant. Much supercomputer cost 
is the processor-memory 
switch, and scaling Is accom- 
plished by having different 
swltch/cablnets for different 
size computers. For example, 
Cray Research, Intel, and Think- 
ing Machines all build active 
switches into their cabinets 
Into which computing ele- 
ments are plugged. The CMS 
and KSR computers require 
switching cabinets when going 
beyond the first levels of their 
hierarchies. KSR Interconnects 
processors using a near zero 
cost ring, since each node Just 
connects to Its next neighbor. 
No computers are truly scalable 
In a linear fashion. 

A size-scalable computer Is 
designed from a small number 
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namic Simulation), numerical li- 
braries (e.g., Unpack for matrix 
solvers, FFT), or full applications 
(e.g., SPEC for workstations, Illi- 
nois's Perfect Club, Los Alamos 
Benchmarks). No matter what mea- 
sure is used to understand a com- 
puter, the only way to understand 
how a computer will perform is to 
benchmark the computer with the 
applications to be used. This also 
measures the mean time before 
answers (mtba), a most important 
measure of computers productivity. 



Several Livermore Loop metrics, 
using a range of three vector 
lengths for the 24 loops, are useful: 
the arithmetic mean typifying the 
best applications (.97 vector ops), 
the arithmetic mean for optimized 
applications (.89 vector ops), geo- 
metric mean for tuned workload 
(.74 vector ops), harmonic mean for 
untuned workload (.45 vector ops), 
and harmonic mean compiled as a 
scalar for an all-scalar operation (no 
vector ops). Three Linpack mea- 



surements are important: Linpack 
100 x 100, Linpack 1,000 x 1,000 
for typical supercomputing appli- 
cations, and Linpeak (for an uncon- 
strained sized matrix). Linpeak is 
the only benchmark that is run ef- 
fectively on a large multicomputer. 
Massive multicomputers can rarely 
run an existing supercomputer 
program (i.e., the dusty deck pro- 
grams) without a new algorithm or 
new program. In fact, the best 
benchmark for any computer is 
whether a manufacturer can and is 



of basic components, with no 
single bottleneck component, 
so the computer can be incre- 
mentally expanded over its , 
designed scan^ 
ingllnear (ncreme 
mance for a welf-defined set of 
> -ajspjlcatlons. the components 
I Include computers, processors 
or processing elements, memo- 
ries, switches, and cabinets. For 
example, since the highly paral- 
lel computers are Intercon- 
nected by switches, the band- 
width of the switch should 
increase linearly with process- 
ing power. It is clear that a bal- 
anced, general-purpose 
teraflop computer Is not feasi- 
ble based on I/O considera- 
tions. For example, If I/O re- 
quirements increase with 
performance as Jn a general- 
purpose computer, then 
roughly 0.1 terabytes/sec or 
20,000 5MB/sec disks operating 
in parallel would be needed to 
balance the computer (about 
one bit of data transferred for 
every flop). Emitting video 
from a computer for direct vis- 
ualization Is one way to effec- 
tively utilize the I/O bandwidth 
and reduce mass storage. 

The key to size scalability is a 
belief In spatial and temporal 
locality, since very large sys- 
tems have Inherently longer 
latencies than small, central 
systems. All supercomputers 
are predicated to some degree 
on locality (i.e., once a datum Is 
accessed a near physical neigh- 
bor will be accessed (spatial) 



and the same datum will be 
repeatedly accessed [tempo- 
ral]). Locality of program execu- 
tion is the phenomenon that 
allowed the flret one-level 
store computer, Atlas to be 
built. This ied to the under- 
standing of paging, virtual 
memory, and working sets that 
are predicated on locality I61. 
Caches exploit spatial and tem- 
poral locality automatically. 
Large register arrays, Including 
vector registers are mecha- 
nisms for a compiler to exploit 
and control locality. 

in 1989, building computers 
that scaled economically over a 
very wide range of Implemen- 
tations looked impractical be- 
cause of the enormous Inter- 
connection bandwidth 
requirements. The Cray YMP 8 
and CM2 scaled over a range of 
eight (eight processors In the 
Cray, and 8K to 64K processing 
elements in the CM2). The Cray 
c-90 scaling range Is 16, and 
other Implementations of the 
Cray architecture increase the 
performance range a factor of 
5. Also, four C90s can be inter- 
connected providing 64 peak 
Cflops. The CM5 scaling range Is 
32 for 1,024 computers and KSR 
1 has a range of 128 (8 to 1,088) 
processor-memory pairs. The 
CMS scaling range extends to 
512 with 16K computers as a 
§480 million ultracomputer. in 
practical terms, a scalable com- 
puter Is one that can exist at 
the largest size an organization 
(Including a country) will ever buy. 



Generation (time) scalability 

Generation (time) scalability Is 
as Important as size scalability, 
since the basic microprocessor 
nodes become obsolete every 
three years: furthermore, the 
time to find an algorithm and 
write a program is long, requir- 
ing significant investment that 
needs to be preserved. A gen- 
eration-scalable computer can 
be implemented in a new tech- : 
noipgy, and thus take advan- 
tage of increased circuit arid 
packaging technologies. Since 
CMOS evolves rapidly, the inter- 
connection bahdwidthimust 
grow at the same rate as pro- ; 
cessing speed and memory:; For 
example, it Is Irrelevant to have 
a design that can exploit : 
"next-generation" mlcrbpro-. 
cessor nodes without Increas- ,■ 
ing the switch bandwidth and : 
decreasing the overhead and 
latency proportionally. Ail char- 
acteristics of a computer must . 
scale proportionally: processing 
. speed, memory speed and 
sizes; Interconnect bandwidth 
and latency, I/O, and software 
overhead In order to be useful 
for a given application. ; ' . 

Problem Scalability: Key to 
Performance 

problem scalability defines 
whether an application is feasi- 
ble on a computer with given 
granularity characteristics, in 
practical terms, problem 
scalability means that a pro- 
gram can be made large 
enough to operate efficiently 
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willing to benchmark a user's pro- 
grams. 

Two factors make benchmarking 
parallel systems difficult: problem 
scalability (or size) and the number 
of processors or job streams. The 
maximum output is the perfectly 
parallel workload case in which 
every processor is allowed to run a 
given -size program independently 
and the uniprocessor work-rate is 
multiplied by the number of proc- 




essors. Similarly, the minimum wall 
clock time should be when all proc- 
essors are used in parallel. Thus, 
performance is a surface of varying 
problem size (scale) and the num- 
ber of processors in a cluster. 

The commercial Alternatives 

The quest generated by the HPCC 
Program and the challenge of par- 
allelism has attracted almost every 



computer company lashing to- 
gether microprocessors (e.g., 
Inmos Transputers, Intel i860s, 
and Spares) to provide commodity, 
multicomputer supercomputing in 
every price range from PCs to 
ultracomputers. In addition, tradi- 
tional supercomputer evolution will 
continue well into the twenty-first 
century. The main fuel for growth 
is the continued evolution of 
"killer" CMOS microprocessors and 
the resulting workstations. 



on a computer with a given 
granularity. Problems such as 
Monte carlo simulation and 
"ray tracing" are "perfectly 
parallel" since their threads of 
computation almost never 
come together. Obtaining par- 
allelism (I.e., performance) has 
turned out to be possible with 
new algorithms and new codes. 
Problem granularity (operations 
on a grid point/data required 
from adjacent grid points) 
must be greater than a ma- 
chine's granularity (node oper- 
ation rate/node-to-node com- 
munication data-rate) in order 
for a computer to be effective, 
several kinds of messages must 
pass among distributed com- 
puter nodes: a priori messages 
that a compiler can generate 
to ensure that data is available 
to a node before It Is needed; 
computer address data, requir- 
ing messages for both address 
and data; and various broad- 
cast and synchronization Infor- 
mation. For example, mes- 
sage-passing and random 
access references are suffi- 
ciently large to render today's 
multlcomputers, ineffective for 
classical benchmarks such as 
the Uvermdre kernels. Denning 
and Tichy 171 discuss the effects 
of problem scalability and gran- 
ularity on performance. 

in the case of models of 
physical structures, as the 
problem size is scaled up by 
increasing the grid points, the 
work or potential parallelism 
and memory increases at least 



Olrf), where n Is the problem 
dimension. The communication 
with other cells only grows as 
0(n 2 ). Thus, communication 
overhead can often be reduced 
or Ignored compared to the 
computation If a problem can 
be made large enough (i.e., get 
enough grid points) to still fit 
In primary memory of a distrib- 
uted node. For example, jn , 
solving LaPIace's equation corm 
putation Is 7 n 5 (thalilme tb 
average the neighboring; ^ 
points) and on a distributed 
memory computer, the . com- 
munlcation is 6 ri 1 , where n is 
the problem dimension, Thus, a 
lOOMflop computer intercom- 
municating at 1 Megaword per 
sec Is balanced when the com- 
putation time of ,07 n 3 
mlcrosec equals the communi- 
cation time of 6 n 2 . microsec. 
Thatis, n must be larger than 
86, to hoid the 3d array of 640K 
points or Just 5MB. About 1 
mlcrosec, however, is required 
to send or receive a word on 
multlcomputers, representing 
an opportunity cost of 2 x 100 
operations. For a problem that 
would fill a 32MB memory, n is 
about 160. This size problem 
requires 0.3 sec of computation 
and 2 x 0.15 sec of send and 
receive time In which the pro- 
cessor Is idle. The Iteration 
time Is 0.6 seconds, resulting in 
a computation rate of 50Mflops 
with only 0.15 sec of communi- 
cation link time, and smaller 
problems would run more 
slowly. Since the memory Is 
full, the problem cannot be 



further scaled to Increase effi- 
ciency. 

For some problems, scaling a 
problem may produce no bet- 
ter results than a coarser grain, 
and less costly solution. An- 
other risk of problem-scaling Is 
to exacerbate the limited I/O 
capability. For example, the 
25K x 25K matrix that Intel's 
Delta used for a 1 scflop matrix 
multiply t^ikes 50B of memory, 
and 500>iec to load using 10MB 
djsks,^ time 
is >2gpo seS:inicQntrast ( a C90 
achieves peak power on a 4K x 
4K matrix 

to compute and 128MB to J ; 
store. Contrast this structure 
with computers connected via 
Ethernet, which has a total 
bandwidth of 1 million words 
per sec for ail nodes, and mes- 
sage-passing overheads of at 
least 1,000 microsec (100,000 
operations on a lOOMflop com- 
puter). However, given a large 
enough problem and enough 
memory per node, even such a 
collection of workstations can 
be scaled to have a long 
enough grain to be effective 
for solving the preceding prob- 
lem. 

Figure 2 shows the structure 
of a basic unit of multithreaded 
computation Independent of 
whether It is run on a SIMD, 
multiprocessor or multicom- 
puter, or has distributed or 
shared components. Hockney 
and Jesshope [141 formulated 
models that predict perfor- 
mance as a function of a com- 
puter's characteristics and 
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Traditional or "True" 
Supercomputer Manufacturer's 
Response 

In 1989 the author estimated that 
several traditional supers would be 
announced this year by U.S. com- 
panies. Cray. Research and NEC 
have announced products, and 
Fujitsu has announced its intent to 
enter the U.S. supercomputer mar- 
ket. Seymour Cray formed Cray 
Computer. Supercomputer Sys- 
tems Inc. lacks a product, and 




DARPA's Tera Computer Com- 
pany is in the design phase. Germa- 
ny's Suprenum project was 
stopped. The French Advanced 
Computer Research Institute 
(ACRI) was started. Numerous Jap- 
anese multicomputer projects are 
underway. Supercomputers are on 
a purely evolutionary path driven 
by increasing clock speed, pipe- 
lines, and processors. Clock in- 
creases in speed do not exceed a 



factor of two every five years (about 
14%). In 1984, a committee pro- 
jected the Cray 3 would operate in 
1986. In 1989, the 128-processor 
Cray 4 was projected to operate at 
1GHz. in 1992. In 1992, the 16- 
processor Cray 3, projected to op- 
erate at 500MHz, was stopped. 
NEC has pioneered exceptional 
vector speeds, retaining the title of 
the world's fastest vector processor. 
The NEC vector processor uses 16 
parallel pipelines and performs 
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problem parallelism. The com- 
putation starts with a sequen- 
tial thread (1), followed by su- 
pervisory scheduling (2), where 
the processors begin threads 
of computation (3), followed by 
Inter-computer messages that 
update variables among the 
nodes when the computer has 
a distributed memory (4), and 
finally synchronize prior to 
beginning the next unit of par- 
allel work (5). The communica- 
tion overhead period (3) Inher- 
ent In distributed memory 
structures Is usually distributed 
throughout the computation 
and possibly completely over- 
lapped. Message-passing over- 
head (send and receive calls) In 
multicomputers can be re- 
duced by specialized hardware 
operating parallel with compu- 
tation. Communication band- 
width limits granularity, since a 
certain amount of data has to 
be transferred with other 
nodes In order to complete a 
computational grain. Message- 
passing calls and synchroniza- 
tion (5) are nonproductive. 

The Paragon should be able 
to operate on relatively 
smaller-grained problems than 
a CMS, since hardware granu- 
larity (node operation rate per 
Internode communication rate) 
appears to be lower. The CM5 
requires a problem-grain 
length of at least 102 opera- 
tions per word transferred 
(128Mflops/10/8Mwords per 
sec); this means that for every 
word transmitted to another 
node, at least 102 operations 



have to be carried out in order 
to avoid waiting for data. Para- 
gon is projected to be 3 opera- 
tions per word. To reduce this 
effective grain size, the prob- 
lem just has to be scaled up to 
"bundle" a large number of 
computational grid points 
within a physical node to re- 
duce Internode communica- 
tion. Very large-grained prob- 
lems typically have several 
thousand operations per grain 
In order to reduce all over- 
heads. 

Additionally, the message- 
passing time Is lost time that 
also limits system performance. 
Assuming the operating system 
is not Involved in passing a 
message, David culler, at Berke- 
ley has measured 3.3 mlcrosec 
for a CMS to send and receive a 
message plus 1.0 mlcrosec per 
word sent or received. During 
this lost time of 2 mlcrosec, 256 
operations could have been 
carried out Paragon attempts 
to reduce the message-passing 
overhead and increase the 
bandwidth by having a sepa- 



Flgure 2. structure or multithreaded, 
parallel computation 



rate processor and block trans- 
fer hardware manage message 
transfers. 

in a distributed memory mul- 
tiprocessor no messages are 
explicitly passed; however, 
hardware processes messages 
and caches data. Multiproces- 
sors avoid several sources of 
software overhead Inherent in 
1992 multicomputers: convert- 
ing the addresses of each 32- 
blt computer Into a single, 
>32-bit global address; decid- 
ing In which, computer to lo- 
cate data/including the possi- 
bility of relocating and 
renaming data dynamically— 
I.e., controlling locality by simu- 
lating caches; passing mes- 
sages containing variables that 
other nodes need just In time 
for another computer to use; 
and passing computed ad- 
dresses and data when random 
access of memory Is required.! 
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operations at a 6.4Gflop rate. Fujit- 
su's supercomputer provides two, 
2.5Gflop vector units shared by 
four scalar processors. Every four- 
or five years the number of proces- 
sors is doubled, providing a gain of 
18% per year. The C90 increased 
the clock frequency by 50% to 
250MHz, doubled the number of 
pipelines and the number of proc- 
essors over the YMP. Figure 1 proj- 
ects a 1995 Cray supercomputer 
that will operate at lOOGflops using 
a double-speed clock, twice the 
number of processors, and twice 
the number of pipelines. 

In most production environ- 
ments, throughput is the measure 




and the C-90 clearly wins, since it 
has a factor of four times the num- 
ber of processors of either Fujitsu 
or NEC. In an environment where 
a small number of production pro- 
grams are run, additional proces- 
sors with scalar capability may be of 
litde use. Environments that run 
only a few coarse-grained codes can 
potentially use smCs for large- 
grained problems. 

The traditional supercomputer 
market does not look toward high 
growth because it provides neither 
the most cost-effective solution for 
simpler scalar programs, nor the 
peak power for massively parallel 



applications. Scalar codes run most 
cost-effectively on workstations, 
while very parallel code may be run 
on massively parallel computers, 
provided the granularity is high 
and the cost of writing the new code 
is low. Despite these factors, I be- 
lieve traditional supercomputers 
will be introduced in 2000. 

"Killer" CMOS Micros for Building 
Scalable computers 

Progress toward the affordable 
teraflop using "killer" CMOS mi- 
cros is determined by advances in 
microprocessor speeds. The projec- 
tion [2] that microprocessors would 
improve at a 60%-per-year rate, 
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Tern Taxonomy 

She taxonomy of the 
tera candidates, shown 
In Figure 3, Includes 
only shared-memorv multiproc- 
essors and various multlcom- 
puters (MIMDs) . A superscalar 
or extra idhi wprd Risc.Vvec- 

with thouiancis of processing 
.elements are ju?t siM 

$ H0;in^rtlpri #earti; •/ ' ' 

Distributed (Boudoir) vs. 
Centralized (Dance Hall) 
Computing: Locality Beliefs 

two attributes structure the 
taxonomy: multiprocessors vs. 
multicomputer; and scalability 
using a physically distributed 
vs. a central memory. Scalability 



measures whether It Is practi- 
cal to construct ultra com- 
puters. A memory Is either cen- 
tralized In a pool, "dance hall" 
(processors— switch- 
memories); or distributed with 
each processor enabling 
scalability, "boudoir" ([pro- 
cessor-memory]— switqh), and 
Is the key to scalability, 
switches bottleneck overall 
performance and limit system 
size in every computer; thus 
the switch is the determinant 
of a computer's scalability, 
switches such as the CMS has, 
allow a very large network of 
computers to be put together 
Just as an arbitrary number of 
workstations or telephones can 
be interconnected. Although 
large switches permit arbitrary 
peak power, the cost Is prob- 



lem granularity, mean time be- 
fore applications, and limited 
applicability. 

Multiprocessors vs. 
Multicomputer* 

The hardware distinction be- 
tween multiprocessors and 
multicomputer Is whether the 
system has and maintains a sin- 
gle address space and a single 
coherent memory and whether 
explicit messages are required 
to access memory on other 
computing nodes as shown in 
the programming view in Fig- 
ure 4. The question is similar to 
RISC vs. CISC, since multicom- 



Flgure I. Taxonomy of multiproces- 
sors and multicomputer 



MIMD 



Multicomputer* 

Multiple Address Space 
Massage-Passing 
..Computation 




Dynamic binding of 
addresses to processors 
KSR 

Static binding, ring multi 
IEEE SCI standard proposal 

Static binding, caching 
Afliant, DASH 

Static program binding 
BBN, Cedar, CM* 



Cross-point or multi-stage 
Cray, Fujitsu, Hitachi, IBM, 
NEC, Tern 

Simple, ring multL.bus 
multi replacement 

Bus multis 

DEC, Encore, NCR, ... 
Sequent, SGI, Sun 




Mesh connected 
Intel 

Butterfly/Fat Tree 
CMS 

Hypercubes 
NCUBE 

Fast LANs for high 
availability and high 
capacity clusters 
DEC, Tandem 
LANs for distributed 
processing 
workstations, PCs 



Central multicomputer 
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Figure 4. Programming views of 
shared-memory multiprocessor and 
distributed multicomputer 
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puter operating systems are 
evolving to carry out the func- 
tions (e.g., address construc- 
tion, caching, message-passing 
for data access) that multipro- 
cessor hardware provides. The 
differences are: 

t. Multiprocessors have a single 
address space supported by 
hardware. Each computer of a 
multicomputer has Its own 
address space, software forms 
a common, global address 
space as a concatenation of a 
computer's node number and 
the computer node'is address 
to support the SPMb program 
model. 

2. Multiprocessors have a sin- 
gle, uniformly accessible mem- 
ory and are managed by and 
provide a single, tlmesriared 
operating system programming 
environment ti.e., Unix). Multi- 
computers are a collection of : 
Independent, interconnected 
computers under control of a 
LAN connected, distributed 
workstation-like operating sys- 
tem. Each computer has a copy 
or kernel of the operating sys- 
tem. 

3. A multiprocessor has a com- 
mon work queue that any pro- 
cessor may access and be ap- 
plied to. in a multicomputer, 
work (programs and their data) 
Is distributed among the com- 
puters, usually on a static basis. 
As the load on the computers 
or clusters change, work may 
have to be moved. 

Any node In a multiprocessor 
may run any size Job from Its 
shared, virtual memory, in mul- 



ticomputer^ a Job's size is lim- 
ited by a node's memory size,, 
and a computer Is Incapable of 
or ineffective at running a col^ 
lection of large; scalar pro- 
grams that typify perfectly par- 
allel applications such as digital 
simulation.. : 

4. Multiprocessors communi- 
cate data implicitly by directly 
accessing a common memory. 
Multicpmputers explicitly pass 
messages that may or may not 
be hidden from the user by 
hardware and the compiler. 

When a multiprocessor Is 
used for parallel processing, 
data and programs are equally 
accessible to all processors. 
Programs and their data must 
be allocated among computers 
in order to minimize message 
passing overhead. To minimize 
message-passing data may a\so:& 
have to be moved and re- 
named as In a virtual memory 
system. ■ . v :v^,.. . 

5. Multiprocessors provide a 
single, sequentially consistent 
memory and program model. , 
since message-passing is used 
to move data In multicomput- 
er^ different copies of vari- 
ables may reside in various 

: computer nodes at one time. 

6. Distributed memory 
multiprocessors have an auto- 
matic mechanism, caching, to 
Implicitly control locality. As a 
datum Is accessed It is auto- 
matically moved to another 
processor's memory, with mul- 
ticomputer every nonlocal 
access requires software for 
address translation, message- 
passing access, and memory 
management to deal with cop- 
ies of data. 

7. Multiprocessors provide the 
most efficient support for mes- 
sage-passing applications be- 
cause messages are passed by 
passing pointers as In 
uniprocessors. Multicomputer 
require moving data. 

8. A multiprocessor Is Inher- 
ently general-purpose, since 
any collection of small to large 



and sequentlal-to-parallel pro- 
grams can be operated on at 
any time. A multicomputer 
operates best on a very large, 
parallel program which Is run 
. to completion. If any of the 
nodes lack a facility that must 
be obtained In other nodes, 
bottlenecks can occur when 
accessing other nodes. 

Because of the general- 
purppseness, scalable multi- 
processors can be applied to 
real time, command and con- 
trol, commercial transaction 
processing and database man- 
agement Multicomputer are 
not general-purpose In terms 
of either applications or Job 
size mix. 

Latency inherent with 
Performance: There's No Free 
Lunch 

Each computer represents a 
trade-off to deal with the In- 
creased latency Inherent in 
bul(dlng a- large computer re- 
: qqifrng^ In the 

carse :6f mulSprdcessors; data Is 
In a shared memory that Is de- 
layed by^ltching (dance hall); 
or: In another distributed pro- * : 
cessbr-memory pair (boudoir). 
Similarly, In a multicomputer, 
explicit messages must be sent 
to another computer. The al- 
ternatives represent trade-offs 
among such issues as how to 
and where to deal with latency, 
the degree of locality, and the 
degree of problem granularity. 
These architectural alternatives 
represent different beliefs 
about application locality: 

1. SlMDs: Put processing ele- 
ments with memory, do opera- 
tions fast, allocate data to mini- 
mize communication with 
other nodes, send data when 
required and wait when parts 
of the computation need to 
share data. (CM1 . , . CM2) 
Thinking Machines abandoned 
SlMD since only one very- 
large-scale parallel Job could be 
run at a given time, making It 
cost-Ineffective and non- 
general-purpose. Also, SlMDs 
have negligible scalar perfor- 
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mance, making them useless 
for anything but massively par- 
allel, coarse-granularity applica- 
tions. 

2. Multlvector processor super- 
computers: use vector proces- 
sors to move more data (a vec- 
tor) In one instruction, overlap 
Instructions, and Join opera- 
tions of several instructions 
together in a single pipeline . 
(chaining). The vector registers 
hide the latency that comes 
with bandwidth. Employ pro- 
grammed controlled buffer 
memories to further cache 
data (Crays, Fujitsu, Hitachi, 

NEC) 

3. Distributed multiprocessors . 
caching, pre-fetching, and 
post-storing are used to hide 
latency. Programs and data 
migrate to a processor-memory 
node on demand. Hardware 
automatically replicates data In 
local nodes using caches and 
maintains memory coherence. 
(KSR, DASH, r, Alewlfe) 

4. Multicomputer*: couple 
processors and memories to 
form cost-effective computer 
nodes. Place the same program 
In all nodes and allocate data 
across all computers to mini- 
mize moving data. When data 
movement is required In 
nonperfectly parallel programs, 
the compiler generates mes- 
sages to transfer data to other 
nodes. Build mechanisms to 
broadcast data and recomblne 
results (TMC and Intel) 

5. Multlstream (or mul- 
tithreaded), multiprocessors: 
Provide a constant, but long 
latency path between physical 
processors and memory. Build 
multi-Instruction stream pro- 
cessors whereby one physical 
processor acts as many sepa- 
rate processors. Pre-fetch and 
post-store data to cover the 
long, constant latency. This 
processor can be used In all 
the preceding computers (Tera, 
T* Alewlfe) 

The species 

The specific distributed multi- 
computers of Figure 3 are seg- 



mented by Interconnection 
bandwidth. LAN-connected 
workstations have the lowest. : 
bandwidth, but In the future 
could provide a significant 
amount of computing power 
by utilizing various parallel 
computing environments, such 
as Linda, Parasoft's Express, or [ 
the Parallel Virtual Machine. 
(PVML Since 1975 Tandem has ,; 
been using clusters of comput- 
ers for redundancy and In- 
creased capacity; DEC intro- 
duced VAX clusters In 1982. 
Seltz (Cal Tech) pioneered the 
multicomputer for supercom- 
putlng, and Intel has built 
three generations based on 
much of this work' Many com- 
panies offer multicomputisrs . 
using Transputers or Intel i860 
processors for practical and 
pedagogical use. 

Two alternative Interconnect 
switches are used for nonst- 
able multiprocessors. A single 
bus, Figure 5 is the simplest 
way to build a multiple . micro- 
processor or "muitr I2i. with 
the evolution of, microproces- 
sors to support "multls" any 
computer from the simplest PC 
can easily become a multipro- 
cessor. Multls are limited by 
the capability of the bus that Is 
formed on printed wiring, and 
hence is not capable of signifi- 
cant size or generation 
scalability, in the future, the 
bus will be replaced by a ring, 
providing the essential features 
of a bus, but scales with size 
and generation (I.e., clock : 
speed since a chip only drives a 
neighbor), as shown in Figure 
6. The bandwidth for a "ring 
muiti" Increases as the number 
of nodes Increase (using multi- 
ple tokens) at the expense of 
increased latency (to be hidden 
by a cache). 

Mainframes and supers use 
cross-points and multistage 
networks to Interconnect pro- 
cessors and memories (Figure 
7). since up to three memory 
accesses may be required to 
execute a statement such as 
A = B + C In order to compute 



Figure 5. Bus "multl" (e.g., DEC. Se- 
quent, SCI, sun) 

Figure 6. Ring "multi" (e.g., IEEE SCI) 

Figure 7. Multiple vector processor 
supercomputers (e.g., Cray, Fujitsu, Hi- 
tachi, NEC) 
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one flop. It is easy to see why 
the switch connecting proces- 
sors and memory limits a com- 
puter. As a switch is Increased 
in size and bandwidth, latency 
and grain increase, worse yet, 
scalar performance decreases. 

Three scalable multiproces- 
sors were built as research ef- 
forts beginning with CMU's 
Cm*. The single address space 
was used to eliminate mes- 
sage-passing and to simplify 
the naming and allocation of 
memory. All required programs 
to be located in particular 
nodes and suffered from the 
. same flaw on which muldcom- 
puters are predicated. Stan- 
ford^ DASH binds programs 
statically to nodes, but uses 
caching to reduce latency 
when a remote node requires, 
data. This reduces or eliminates 
the need for perfect data -to- 
npde assignment, Nevertheless, 
over long-term use, it Is imper- 
ative to move data perma- : 
nently to the computing node ; 
requiring the data. : : / : 

The KSR snip to be described .. 
solves the data-to-hode assign- 
ment problem Inherent in a 
distributed memory mP by pro- 
viding hardware that controls 
the automatic migration of 
memory to any that may need 
It. KSR's breakthrough occurred 
by conceptually eliminating 
physical addresses and making 
the memory into a cache so 
that Information could be auto- 
matically, moved to a processor 
when needed. ■ 
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providing a quadrupling of perfor- 
mance each three years still appears 
to be possible for the next few years 
(Table 3). The quadrupling has its 
basis in Moore's Law stating that 
semiconductor density would quad- 
ruple every three years. This ex- 
plains memory-chip-size-evolution. 
Memory size can grow proportion- 
ally with processor performance, 
even though the memory band- 
width is not keeping up. Since clock 
speed only improves at 25% per 
year (a doubling in three years), the 
additional speed must come from 
architectural features (e.g., su- 
perscalar or wider words, larger 
cache memories, and vector pro- 
cessing). 

The leading edge microproces- 
sors described at the 1992 Interna- 
tional Solid State Circuits Confer- 
ence included: a microprocessor 
based on Digital's Alpha architec- 
ture with a 150- or 200MHz clock 
rate; and the Fujitsu 108 (64)|216 
(32-bit) Mflop Vector Processor 
chip that works with Sparc chips. 
Using the Fujitsu chip with a micro- 
processor would provide the best 
performance for traditional super- 
computer-oriented problems. Per- 
haps the most important improve- 
ment to enhance massive 
parallelism is the 64-bit address 
enabling a computer to have a large 
global address space. With 64-bit 
addresses and substantially faster 
networks, some of the limitations of 
message-passing multicomputer 
can be overcome. 

In 1995, $20,000 distributed 
computing node microprocessors 
with peak speeds of 400 to 800 
Mflops can provide 20,000 to 
40,000 flops/}. For example, such 
chips are a factor of 12 to 25 times 
faster than the vector processor 
chips used in the CMS and would 
be 4.5 to 9 times most cost-effective. 
Both ECL and GaAs are unlikely 
runners in the teraflop race since 
CMOS improves so constantly in 
speed and density. Given the need 
for large on-chip cache memories 
and the additional time and cost 
penalties for external caches, it is 
likely that CMOS, will be the semi- 




conductor technology for scalable 
computers. 

Not Just Another workstation: A 
Proposal for Having Both High- 
Performance Workstations and 
Massive Parallelism 

Workstations are the purest and 
simplest computer structure able to 
exploit microprocessors since they 
contain little more than a processor, 
memory, CRT, network connec- 
tion, and i/o logic. Furthermore, 
their inherent CRTs solve a signifi- 
cant part of the i/o problem. A 
given workstation or server node 
(usually just a workstation without a 
CRT, but with large memory and a 
large collection of disks) can also 
become a multicomputer. 

Nielsen of Lawrence Livermore 
National Laboratory (LLNL) has 
outlined a strategy for transitioning 
to massively parallel computing 
[18]. LLNL has made the observa- 
tion that is spends about three times 
as much on workstations that are 
only 15% utilized, as it does on 
supercomputers. By 1995, micro- 
processor-based workstations could 
reach a peak of 500Mflops, provid- 
ing 25,000 flops per dollar or 10 
times the projected cost-effective- 
ness of a super. This would mean 
that inherent in its spending, LLNL 
would have about 25 times more 
unused peak power in its worksta- 
tions than it has in its central super- 
computer or specialized massively 
parallel computer. 

The difficult part of using work- 
stations as a scalable multicomputer 
(smC) is the low-bandwidth com- 
munication links that limit their 
applicability to long-grained prob- 
lems. Given that every workstation 
environment is likely to have far 
greater power than a central super, 
however, the result should clearly 
justify the effort. An IEEE stan- 
dard, the Scalable Coherent Inter- 
face or SCI, is being implemented 
to interconnect computers as a sin- 
gle, shared-memory multiproces- 
sor. SCI uses a ring, such as KSR, to 
interconnect the computers. A dis- 
tributed directory tracks data as 



copies migrate to the appropriate 
computer node. Companies such as 
Convex are exploring the SCI for 
interconnecting HP's micros as an 
alternative and preferred mini- 
super that can also address the 
supercomputing market. 

A cluster of workstations inter- 
connected at speeds comparable to 
Thinking Machines's CMS, would 
be advantageous in terms of power, 
cost-effectiveness, and administra- 
tion compared with LAN-con- 
nected workstations and supercom- 
puters. Such a computer would 
have to be centralized in order to 
have low latency. Unlike traditional 
timeshared facilities, however, 
processors could be dedicated to 
individuals to provide guaranteed 
service. With the advent of HDTV, 
low-cost video can be distributed 
directly to the desktop, and as a 
byproduct users would have video 
conferencing. 

smPs: Scalable Multiprocessors 
The Kendall Square Research 
KSR L The Kendall Square Re- 
search KSR 1 is a size-and-genera- 
tion-scalable, shared-memory mul- 
tiprocessor computer. It is formed 
as a hierarchy of interconnected 
"ring multis." Scalability is achieved 
by connecting 32 processors to 
form a "ring multi" operating at 
oneGB/sec (128 million accesses per 
sec). Interconnection bandwidth 
within a ring scales linearly, since 
every ring slot may contain a trans- 
action. Thus, a ring has roughly the 
capacity of a typical cross-point 
switch found in a supercomputer 
room that interconnects 8 to 16, 
l()0MB/sec HIPPI channels. The 
KSR 1 uses a two-level hierarchy to 
interconnect 34 rings (1,088 proc- 
essors), and is therefore massive. 
The ring design supports an arbi- 
trary number of levels, permitting 
ultras to be built. 

Each node is comprised of a pri- 
mary cache, acting as a 32MB pri- 
mary memory, and a 64-bit su- 
perscalar processor with roughly 
the same performance as an IBM 
RS6000 operating at the same 
clock- rate. The superscalar proces- 
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sors containing 64 floating-point 
and 32 fixed-point registers of 64 
bits is designed for both scalar and 
vector operations. For example, 16 
elements can be pre-fetched at one 
time. A processor also has a 0,5MB 
sub-cache supplying 20 million ac- 
cesses per sec to the processor 
(computational efficiency of 0.5). A 
processor operates at 20MHz. and 
is fabricated in 1 .2 micron CMOS. 
The processor, sans caches, contains 
3.9 million transistors in 6 types of 
12 custom chips. Three-quarters of 
each processor consists of the 
Search Engine responsible for mi- 
grating data to and from other 
nodes, for maintaining memory 
coherence throughout the system, 
using distributed directories, and 
ring control. 

The KSR 1 is significant because 
it provides size- (including I/O) and 
generation-scalable smP in which 
every node is identical; an efficient 
environment for both arbitrary 
workloads (from transaction pro- 
cessing to timesharing and batch) 
and sequential to parallel process- 
ing through a large, hardware- 
supported address space with an 
unlimited number of processors; a 
strictly sequential consistent pro- 
gramming model; and dynamic 
management of memory through 
hardware migration and replication 
of data throughout the distributed, 
processor memory nodes, using its 
Allcache mechanism. 

With sequential consistency, 
every processor returns the latest 
value of a written value, and results 
of an execution on multiple proces- 
sors appear as some interleaving of 
operations of individual nodes 
when executed on a multithreaded 
machine. With Allcache, an address 
becomes a name and this name au- 
tomatically migrates throughout 
the system and is associated with a 
processor in a cache-like fashion as 
needed. Copies of a given cell arc 
made by the hardware and sent to 
other nodes to reduce access time. 
A processor can pre-fetch data into 
a local cache and post-store data for 
other cells. The hardware is de- 
signed to exploit spatial and tempo- 




ral locality. For example, in the 
SPMD programming model, copies 
of the program move dynamically 
and are cached in each of the oper- 
ating nodes 1 primary and processor 
caches. Data such as elements of a 
matrix move to the nodes as re- 
quired simply by accessing the data, 
and the processor has instructions 
that pre-fetch data to the proces- 
sor's registers. When a processor 
writes to an address, all cells are 
updated and memory coherence is 
maintained. Data movement occurs 
in sub-pages of 128 bytes (16 
words) of its 16K pages. 

Every known form of parallelism 
is supported via KSR's Mach-based 
operating system. Multiple users 
may run multiple sessions, compris- 
ing multiple applications, compris- 
ing multiple processes (each with 
independent address spaces), each 
of which may comprise multiple 
threads of control running simulta- 
neously sharing a common address 
space. Message-passing is sup- 
ported by pointer-passing in the 
shared memory to avoid data copy- 
ing and enhance performance. 

KSR also provides a commercial 
programming environment for 
transaction processing that accesses 
relational databases in parallel with 
unlimited scalability, as an alterna- 
tive to multicomputers formed 
from multiprocessor mainframes. 
A lK-node system provides almost 
two orders of magnitude more pro- 
cessing power, primary memory, 
I/O bandwidth, and mass storage 
capacity than a multiprocessor 
mainframe. For example, unlike 
the typical tera-candidates, a 1,088- 
node system can be configured with 
15.3 terabytes of disk memory, pro- 
viding 500 times the capacity of its 
main memory. The 32- and 320- 
node systems are projected to de- 
liver over 1,000 and 10,000 transac- 
tions per sec, respectively, giving it 
over a hundred times the through- 
put of a multiprocessor mainframe. 

smcs: Scalable Multicomputers for 
"Commodity supercomputing" 
Multicomputer performance and 



applicability are determined by the 
number of nodes and concurrent 
job streams, the node and system 
performance, I/O bandwidth, and 
the communication network band- 
width, delay, and overhead time. 
Table 1 gives the computational 
and workload parameters, but for a 
multicomputer operated as a 
SPMD, the communications net- 
work is quite likely the determinant 
for application performance. 

Intel Paragon: A Homogeneous 
Multicomputer. This is shown in 
Figure 8. A given node consists of 
five i860 microprocessors: four 
carry out computation as a shared- 
memory multiprocessor operating 
at a peak of 300Mflops rate, and 
the fifth handles communication 
with the message-passing network. 
Each processor has a small cache, 
and the data-rate to primary mem- 
ory is 50 million accesses per sec, 
supporting a computational inten- 
sity of 0.67 for highly select prob- 
lems. The message-passing proces- 
sor and the fast 2D mesh topology 
provide the very high, full-duplex 
data-rate among the nodes of 
200MB/sec. The mesh provides 
primitives to support synchroniza- 
tion and broadcasting. 

Paragon is formed as a collection 
of nodes controlled by the OSF1 
(Mach) operating system with micro 
kernels that support message-pass- 
ing among the nodes. Each node 
can be dynamically configured to 
be a service processor for general- 
purpose timesharing, or part of a 
parallel-processing partition, or an 



Figure 8. Intel Paragon multicom- 
puter 
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I/O computer because it attaches to 
a particular device. A variety of 
programming models are pro- 
vided, corresponding to the evolu- 
tion toward a multiprocessor. Two 
basic forms of parallelism are sup- 
ported: SPMD using a shared vir- 
tual memory and MIMD. With 
SPMD, a single, partition-wide, 
shared virtual memory space is cre- 
ated across a number of computers, 
using a layer of software. Memory 
consistency is maintained on a page 
basis. With MIMD a program is 
optimized to provide the highest 
performance within a node using 
vector processing, for example. 
Messages are explicitly passed 
among the nodes. Each node can 
have its own virtual memory. 

CM5: A Multicomputer Designed 
to Operate as a Collection of 
SIMDs. The CM5 is shown in Fig- 
ure 9 consisting of 1 to 32 Sun 
server control computers, Cc, (for a 
lK-node system), on which user 
programs run; the computational 
computers, Cv, with vector units; 
Sun-based I/O server nodes, Cio; 
and a switch to interconnect the ele- 
ments. The system is divided into a 
number of independent partitions 
with at least 32 Cv's that are man- 
aged by one Cc. A given partition- 
ing is likely to be static for a rela- 
tively long time (e.g., work shift to 
days). The Sun servers and I/O 
computers run variants of Sun O/S, 
providing a familiar user-operating 
environment together with all the 
networking, file systems, and 
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Figure 9. CM 5 multicomputer 



graphical user interfaces. Both the 
SPMD and message-passing pro- 
gramming models are supported. 
Each of the computational nodes, 
Cv, can send messages directly to 
Cio's, but other system calls must be 
processed in the Cc. 

The computational nodes are 
Sparc micros that control four in- 
dependent vector processors that 
operate independently on 8MB 
memories. A node is a message- 
passing multicomputer in which the 
Sparc moves data within the four 
8MB memories. Memory data is 
accessed by the four vector units at 
16Maccess per sec (Maps) each, 
providing memory bandwidth for a 
computational intensity of 0.5. 
Conceptually, the machine is 
treated as an evolution of the SIMD 
CM2 that had 2K floating-point 
processing elements connected by a 
message- passing hypercube. Thus, 
a IK-node CMS has 4K processing 
element, and message-passing 
among the 4 vector units and other 
nodes is controlled by the Sparc 
processors. The common program 
resides in each of the nodes. Note 
that using Fujitsu Vector Processing 
chips instead of the four CMS vec- 
tor chips would increase peak per- 
formance by a factor of 3.3, making 
the 1995 teraflop peak achievable 
at the expense of a well-balanced 
machine. 

Computational intensity is the 
number of memory accesses per 
flop required for an operation(s) of 
a program. Thus depending on the 
computational intensity of the op- 
erations, speed will vary greatly. 
For example, the computational 
intensity of the expression A = B + 
C is 3, since 3 accesses are required 
for every flop, giving a peak rate of 
21Mflops from a peak of 128. A 
C90 provides L5Maps per 1M flops, 
or a 16-processor system is capable 
of operating at 8Gflops. 

The switch has three parts: diag- 
nosis and reconfiguration; data 
message-passing; and control. The 
data network operates at SMB/sec 
full duplex. A number of control 
messages are possible, and all proc- 



essors use the network. Control 
network messages include broad- 
casting (e.g., sending a scalar or 
vector) to all nodes unless it ab- 
stains, results recombining (net- 
work carries out arithmetic and log- 
ical operations on data supplied by 
each node), and global signaling, 
synchronization for controlling 
parallel programs. The switch is 
wired into each cabinet that holds 
the 256 vector computers. 

A Score of Multicomputers. Cray 
Research has a DARPA contract to 
supply a machine capable of peak 
teraflop operation by 1995, and a 
sustained teraflop by 1997 using 
DEC Alpha microprocessors. Con- 
vex has announced it is working on 
a massively parallel computer, 
using HP's microprocessor as its 
base. IBM has several multicom- 
puter systems that it may produc- 
tize based on RS6000 workstations. 
Japanese manufacturers are build- 
ing multicomputers, for example, 
using a comparatively small num- 
ber (100s) of fast computers (i.e M 
1 Gflop) interconnected via very 
high-speed networks in a small 
space. 

A number of multicomputers 
have been built using the Inmos 
Transputer and Intel i860 (e.g., 
Transtech Parallel Systems). Mer- 
cury couples 32, 40MHz, i860s and 
rates the configuration at 2. 5G flops 
for signal processing, simulation, 
imaging, and seismic analysis. 
Meiko's 62-node multicomputer 
has a peak of 2.5Gflops and deliv- 
ered 1.3Gflops for Linpeak, or 
approximately half its peak on a 
0(8500) matrix [8]. Parsytec GC 
consists of 64 16K nodes, delivering 
a peak of 400Glops. The nCUBE 2 
system has up to 8K nodes with up 
to 64MB per node. 

Multicomputers are also built for 
specific tasks. IBM's Power Visual- 
izer uses several i86()s to do visuali- 
zation transformations and render- 
ing. AT&T DSP3 Parallel Processor 
provides up to 2.56Gflops, formed 
with 128 signal-processing nodes. 
The DSP3 is used for such tasks as 
signal- and image-processing, and 
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speech recognition. The DARPA- 
Intel iWarp™ developed with CMU 
is being used for a variety of signal- 
and image-processing applications 
that are typically connected to 
workstations. An iWarp node pro- 
vides only lOMflops (or 20Mflops 
for 32-bit precision) and 0.5 to 
16MB of memory per node. Each 
node can communicate at up to 
320MB per sec on 8 links. 

The Teradata/NGR/AT&T sys- 
tems are used for database retrieval 
and transaction-processing in a 
commercial environment. The sys- 
tem is connected by a tree-struc- 
tured switch, and the hundreds of 
Intel 486 leaf nodes processors 
handle applications, communica- 
tions, or disk access. A system can 
process over 1,000 transactions per 
sec working with a single database, 
or roughly four times the perfor- 
mance of a multiprocessor main- 
frame. 

Programming Environments 
to Support Parallelism 

Although the spectacular increases 
in performance derived from mi- 
croprocessors are noteworthy, per- 
haps the greatest breakthrough has 
come in software environments 
such as Linda, the Parallel Virtual 
Machine (PVM). and ParasoiVs 
Kx press that permit users to struc- 
ture and control a collection of pro- 
cesses to operate in parallel on 
independent computers using mes- 
sage passing. Of course, user inter- 
lace software, debuggers, perfor- 
mance-monitoring, and many other 
tools are part of these basic parallel 
environments. 

Several programming models 
and environments are used to con- 
trol parallelism. For multiproces- 
sors, small degrees of parallelism 
are supported through such mech- 
anisms as multitasking and Unix™ 
pipes in an explicit or direct user 
control fashion. Linda extends this 
model to manage the creation ami 
distribution of independent pro- 
cesses for parallel execution in a 
shared address space. Medium (Hi 
to 100) and high degrees of paral- 
lelism (1,000) for a single job can be 




carried out in either an explicit 
message-passing or implicit fash- 
ion. The most straightforward im- 
plicit method is the SPMD model 
for hosting Fortran across a num- 
ber of computers. A Fortran 90 
translator should enable multiple 
workstations to be used in parallel 
on a single program in a language 
evolutionary fashion. Furthermore, 
a program written in this fashion 
can be used equally effectively 
across a number of different envi- 
ronments from supercomputers to 
workstation networks. Alterna- 
tively, a new language, having more 
inherent parallelism, such as 
dataflow may evolve. Fortran will 
adopt it. 

Research Computers 

Much of university computer archi- 
tecture research is aimed at scal- 
able, shared-memory multiproces- 
sors (e.g., [20]) and supported by 
DARPA. In 1991, MITI sponsored 
the first conference on shared- 
memory multiprocessors in Tokyo, 
to increase international under- 
standing. It brought together re- 
search results from 10 universities 
(eight U.S. two Japanese), and four 
industrial labs (three U.S., one Jap- 
anese). This work includes, direc- 
tory schemes to efficiently "track" 
cached data as it is moved among 
the distributed processor-memory 
pairs, performance analysis, inter- 
connection schemes, multithreaded 
processors, and compilers. 

Researchers at the University of 
California, Berkeley are using a 64- 
node CMS to explore various pro- 
gramming models and languages 
including dataflow. Early work in- 
cludes a library to allow the com- 
puter to simulate a shared memory 
multiprocessor. An equally impor- 
tant pan of Berkeley's research is 
the Sequoia 2000 project being 
done in collaboration with NASA 
and DEC that focuses on real-time 
data acquisition of 2 terabytes of 
data per day, secondary and terti- 
ary memories, and very large data- 
bases requiring multiple accesses. 

Soil/ at Cal Tech, developed the 



first multicomputer (intercon- 
nected via a hypercube network) 
and went on to develop high- 
bandwidth grid networks that use 
wormhole routing. The basic switch 
technology is being used in a variety 
of multiprocessor and multicom- 
puters including Intel and Cray. 

The CEDAR project at the Uni- 
versity of Illinois is in the comple- 
tion phase, and scores of papers 
describe the design, measurements, 
and the problem of compiling for 
scalable multiprocessors. Unfortu- 
nately, CEDAR was built on the 
now defunct Alliant Multiproces- 
sor. 

MIT has continued researching 
multiple machine designs. The 
Monsoon dataflow computer be- 
came operational with 16, 5Mflop 
nodes and demonstrates scalability 
and implicit parallelism using a 
dataflow language. The next 
dataflow processor design, T* is 
multithreaded to absorb network 
latency. The J-machine is a simple 
multicomputer designed for mes- 
sage-passing and low-overhead sys- 
tem calls. The J-machine, like the 
original RISC designs, places hard- 
ware functions in software to sim- 
plify the processor design. A 
J-machine, with sufficient software, 
carries out message-passing func- 
tions to enable shared-memory 
multiprocessing. Alewife, like Stan- 
ford's DASH is a distributed multi- 
ple multithreaded processor that is 
interconnected via a grid switch. 
Additional efforts are aimed at 
switches and packaging, including a 
3D interconnection scheme. 

Rice University continues to lead 
compiler research in universities 
and was responsible for the HPF 
Forum. HPF is a successor to For- 
tran D (data parallelism) that was 
initially posited for all high-perfor- 
mance machines (SIMDs, vector 
multiprocessors, and multicomput- 
ers). The challenge in multieom- 
puters is initial data allocation, con- 
trol of memory copies, and 
avoiding latency and overhead 
when passing messages. 

Stan ford's DASH is a scalable 
multiprocessor with up to 64 pro- 
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Role of Computer and 
Computational Science 

recent panel of computer and computational scientists 
t described their reservations about the progress in paral- 
Jel computing, expressing concerns about training and 
people Interested In computational science, machine availability, 
and lack of standards caused by too many programming models 
[191. They compared progress to the difficulty in learning to use 
vector processors. A recent study by the IEEE Technical Commit- 
ted on Supercdmputlng (IEEE 1992) showed that out of approxi- 
mately 8,000 users of the NSF Supercomputer centers, less than 
100 computer science and about 200 electrical engineering users 
used a negligible amount of the resources. With today's parallel 
computers that are an artifact of massive federal funding, how- 
ever, computer science has been slowly attracted to helping un- 
derstand fundamentals. 

In 1987, as assistant director of computing at NSF, I urged the 
computer science community to become involved in parallelism 
by using and understanding the plethora of computers that can 
be applied to computational science (21. This would entail under- 
standing applications, Including solving problems using new paral- 
lel structures, writing texts, training students, and carrying out 
research; Today, much computer science research Is devoted to 
some form of parallelism. Making significant contributions In par- 
allelism requires understanding and solving problems that are 
usually numerically intensive. Numerical analysis is not part of the 
computer science core curriculum, in a similar fashion, the results 
of supercomputing often require visualization (also not part of 
the curriculum). Visualization also changes the nature of I/O and 
mass storage. Now, I suggest the following: 

1. Collaborating with scientists and engineers on real problems 
using real computers. This enhances the training of computa- 
tional scientists who understand and enrich computer science. 

2. Training and understanding using traditional, uniprocessor 
supercomputers and shared-memory multiprocessors that pro- 
vide fine granularity. If code runs poorly on a super or a shared- 
memory multiprocessor, It Is certain to run poorly on a distrib- 
uted multicomputer. 

3. installing, teaching, and using environments composed of exist- 
ing workstations that have very long granularity. These can and 
must be dealt with using programming environments such as 
Linda or PVM. 

Attaching siMDs and multicomputer^ to workstations for special- 
ized problems. 

4. Progressing to problems and algorithms that can tolerate the 
latencies inherent In multlcomputers. 

5. Designing benchmarks and workloads typifying new programs 
and computers to enhance understanding. Collaborating on com- 
puters that are being designed, and making them run well Is 
much more Important than producing any more computers. 

we must thoroughly understand the machines we have by using 
and measuring them on a range of real problems. The goal should 
be to look at a problem / program / algorithm and know a priori 
how an application will run, based on the computer / compiler as 
measured by Its various parameters. Making a parallel application 
run effectively Is an ad ftocart ■ 



cessors arranged in a grid of 4 X 4 
nodes. Each node consists of a four- 
processor Silicon Graphics multi- 
processor that is roughly equivalent 
to a uniprocessor with an attached 
vector unit. DASH demonstrated 
linear speedups for a wide range of 
algorithms, and is used for com- 
piler research. Some applications 
have reached over lOOMflops for 
16 processors, which is about the 
speed a four-vector processor sys- 
tem of comparable speed achieves. 
Since the system is relatively slow, it 
is unclear which principles have 
applicability for competitive com- 
puters. 

DARPA has funded Tera Com- 
puter to start-up. Tera, a second- 
generation HEP, is to have 256 128- 
instruction stream processors or 
32K processors and operate in 
1995. With a multiple instruction 
stream or multithreaded processor, 
any time a processor has to wait for 
a memory access, the next instruc- 
tion stream is started. Each proces- 
sor is built using very fast gate ar- 
rays (e.g., GaAs) to operate at 
400MHz. The expected latency to 
memory is between 40 and 100 
ticks, but since each processor can 
issue multiple requests, a single 
physical processor appears to sup- 
port 16 threads (or virtual proces- 
sors). Thus, a processor appears to 
have access to a constant, zero la- 
tency memory. Since a processor is 
time-shared, it is comparatively 
slow and likely to be unusable for 
scalar tasks, and is hardly a general- 
purpose computer according to 
Smith's definition [21]. The physi- 
cal processors connect to 512 mem- 
ory units, 256 I/O cache units (i.e., 
slower memories used for buffer- 
ing I/O), and I/O processors 
through a 4K-node interconnection 
network. In order to avoid the 
"dance hall" label, the network has 
four times more nodes than are 
required by the components. Tern's 
approach has been to design a com- 
puter that supports a parallelizing 
compiler. 

Summary and Conclusions 

In 1989 I projected that supercom- 
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puters would reach a teraflop by 
1995 using either a SIMD or multi- 
computer approach, but neglected 
to mention the price. By mid- 1992, 
scalable multicom puters have im- 
proved by a factor of 5 to 10 to 
reach 100 ± 20 Gflops at a super- 
computer price level, and the SIMD 
approach was dropped. Scalable 
multicom puters also break through 
the $30 million supercomputer bar- 




rier to create a teraflop ultracom- 
puter ($50 to 400 million), and are 
not recommended buys. In 1995 
semiconductor gains should in- 
crease performance by a factor of 
four and competition should re- 
duce the price a factor of two to 
supercomputer levels as projected 
in Figure 1. Given the number of 
applications and state of training, 



waiting for the teraflop at the 
supercomputer price level is rec- 
ommended. 

Multiprocessors have been the 
mainline of computing because 
they are general-purpose and can 
handle a variety of applications 
from transaction-processing to 
time-sharing, that are highly se- 
quential to highly parallel. KSR 
demonstrated that massive, scalable 



Government Policy: Why We Don't 
Need "State" Computer Architectures 



n January 1992, the Presi- 
dent signed a law author- 
izing the spending of $1 
billion for various agencies to 
comply with the HPCC Act. The 
law provides for building a Na- 
tional Research and Education 
Network (NREN), as well as 
work on parallel computing, 
algorithms, and computer sci- 
ence education. The HPCC Re- 
port (OSTP, 1992) outlines the 
role of the various agencies 
(DARPA, DOC, DOE, EPA, NIH/ 
NLM, and NSF) in computing 
systems, software technology 
and algorithms, the network, 
and basic research and human 
resources. The report outlines 
a variety of grand challenges in 
science and engineering, rang- 
ing from weather and climate 
prediction and global change, 
to astronomy, semiconductors, 
superconductors, speech, and 
vision. According to the re- 
port's budget, DARPA has two- 
thirds of the budget, or over 
$100 million In 1992 for high- 
performance computing sys- 
terns? other agencies have the 
grand challenge problems. 
Undoubtedly, the most Impor- 
tant aspect of the program will 
be training and the network. 
So far, the architectures and 
companies resulting from this 
massive funding have been less 
than spectacular, which con- 
firms my opinion that DARPA 
should not directly fund the 
development of computers at 
companies. 



DARPA has a long and suc- 
cessful record of sponsoring 
university research that creates 
companies and products such 
as MIPS, Sun, and Sparc in cases 
where no products or technol- 
ogy existed. It fostered Al, 
graphics, operating systems, 
packet switching, speech un- 
derstanding, time-sharing, VLSI 
design, and workstations at 
universities, supercomputlng, 
Including using a massive 
(>1000) number of processors 
is a commercial area that has 
been developed by Industry 
and does not require the selec- 
tion or support of particular 
architectures or companies. 
DARPA's role in the develop- 
ment of massive parallelism can 
be terminated because it has 
been picked up by Industry. 
Almost a dozen companies are 
building multicomputer that 
compete with DARPA's incestu- 
ous product divisions (Cray, 
Intel, Tera, and Thinking Ma- 
chines). The situation of fund- 
ing the design and purchase of 
a computer Is not a healthy 
one. 

I know of no successful prod- 
ucts developed by funding 
company product develop- 
ment, including the vast array 
of military computers. DARPA 
funded Burroughs to build the 
unsuccessful llliac IV In the late 
1960s, a 64-processor simd. 
ARPA funded BBN to provide 
the first switching computers 
for ARPAnet BBN was success- 



ful for almost a decade during 
which It had a technology 
monopoly as the government 
paid for product development 
and bought and tested its 
products, in 1992, BBN Is a 
minor, supplier in a flourishing 
communications market Simi- 
larly, BBN's government-funded 
computer development that 
was initiated by DARPA folded 
in 1991. The several hundred 
million dollars of funds that 
went into a couple of massively 
parallel computers this last 
decade could have been used 
to provide substantially more 
computing power to real users. ; 
By way of contrast, NSF spends 
about $60 million annually to 
support four supercomputlng 
centers. Funding to create a 
monopoly company only inhib- 
its the development of tech- 
nology and products at other 
companies: Product develop- 
ment contracts and purchase 
orders have not created lasting 
companies, and are not likely 
to in the future, companies 
funded In an incestuous fash- 
Ion simply cannot stand up to a 
"real" market, and will not be 
able to compete internatlon : 
ally. A large fraction of the 
market Is reduced or elimi- 
nated when government funds 
and buys Its own designs, clos- 
ing the early adopter and Inno- 
vator government markets to 
privately financed computer 
companies. Furthermore, once 
started, DARPA-funded compa- 
nies require continued funding 
to remain healthy . . . Just as 
the defense contractors that 
are being downsized. 
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distributed, shared-memory multi- 
processors (smPs) are feasible. Mul- 
ticomputer from the score of com- 
panies combining computers will 
evolve to multiprocessors just to 
reduce overhead in simulating a 
single-memory address space, 
memory access, and supporting ef- 
ficient multiprogramming. Next- 
generation multicomputer are 
likely to resemble BBN's distrib- 




uted memory computers as they 
evolve to become multiprocessors. 

Important gains in parallelism 
have come from software environ- 
ments that allow networks of com- 
puters to be applied to a single job. 
Thus every laboratory containing 
fast workstations has or will have its 
own supercomputer for highly par- 
allel applications. The rapid in- 



crease in microprocessor power 
ensures that the workstation will 
perform at near super speed for 
sequential applications. LAN envi- 
ronments can provide significant 
supercomputing for highly parallel 
applications by 1995. It is critical 
for companies to provide fast, low- 
overhead switches that will allow 
users to configure multicomputers 
composed of less than 100 high- 



ly taking oyer computer de- 
sign through funding and pur- 
chasing, users do not bench- 
mark or understand the 
machines they are given or 
forced to use. At a time when 
glgabucks for teraflops may 
Induce brain damage, It Is criti- 
cal to consider all the factors 
of an architecture and espe- 
cially the mean time before 
answers. In the future, govern- 
ment laboratories are likely to 
bekmeasured on their ability to 
replicate and transfer results, 
and having -programs that run 
well across a variety of ma- 
chines Is more Important than 
exploiting the latest fad. When 
government support for the 
HPtc program ends, It Is the 
market rather than a bureau- 
crats dream of an architecture 
or industry, that determines 
economics. 

At the Sid Fernbach Memorial 
symposium; February 1992, 
speakers reiterated the policy 
that Fernbach used, as head of 
computation at Lawrence Liver- 
more National Laboratory, to 
help supercomputing come 
Into existence; be a knowl- 
edgeable, demanding, tolerant, 
and helpful customer. Depart- 
ment of Energy laboratories 
purchased, not funded the de- 
velopment of early computers 
by providing needs specifica- 
tions. 

Following is a suggested policy 
to support development of 
high-performance computing: 

1. The concept of the 
ultracomputer is so artificial 
and deleterious to the com- 



puter industry that buying a 
single Ultra should be discour- 
aged. An affordable teraflop 
will come by 1995. Let evolu- 
tion work to produce better 
computers that are balanced 
and usable, not aimed at a 
single, peak number. 

2. support users to purchase 
machines that can be Justified 
for specific programs at various 
agencies and organizations tHajt 
have "grand challenge^prpb- V 
lems. contracts would |if Jo 
bid, and benchmarks that ^ 
acterlze the user, wrkioadi;, ^ 
would be required, f he ^es- 
tablishment of benchmarking 
would cause -reality to replace 
hope as a buying criteria. 

Allow universities to choose 
the computers they buy. Don't 
control^ 

desigi^^b^p^tere^rom 
W^shin^dn^ftnding agency, 
congress)? Given the specialized 
nature and high cost of a 
supercomputer or ultracom- 
puter, users (e.g., weapons de- 
signers) who can justify them 
from tool and experiment bud- 
gets should simply buy them. 

3. Encourage collaboration. Any 
company should be free to 
work with a laboratory project 
to produce technology, proto- 
type, or product. Fund univer- 
sity projects (not their 
codevelopers) where a 
codeveloplng company Is capa- 
ble of or likely to be able to 
take the product to market 

Encourage laboratories to 
obtain clustered computers 
based on workstation nodes 



providing more than an order 
of magnitude more peak power 
than supers. Such machines 
would provide the same power 
as scalable multicomputers, but 
in addition, dedicated power 
for visualization and video con- 
ferencing, 

4. Do not fund computer de- 
velopment perse. Industry has 
always been able to fund good 
Ideas. There Is realty no effec- 
tive way to select ttie "right" 
winner fro once 
sferte^{fun0in9' hecorjnes an 
ongoing goVernm 

bllity and right for startups: 
get the money and do It in the 
case of large companies: If the 
technology is worth funding, 
fund It. A company only takes 
government money to build a 
computer If the project Is not 
worthwhile, and they are sup- 
porting Its staff. 

5. Encourage the use of com- 
puters In universities vs. de- 
signing more computers by 
people who have never used or 
built a computer. The world Is 
drowning In computers that 
absorb programmers trying to 
realize peak performance. 

6. Eliminate funding by con- 
gressional-directed centers 
even though this might work. 
Chose centers based on com- 
petence. 

7. In the very unlikely event 
that no one Is building the 
appropriate computer and a 
special one must be funded for 
clear military need, build a few 
prototypes (s2), open the pro- 
cess to all bidders without the 
usual military procurement 
hassles. ■ 
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performance workstations, because 
these are likely to provide the most 
cost-effective and useful computing 
environments. 

A petaflop (10 15 floating-point 
operations per sec) at less than $500 
million level is likely to be beyond 
2001. By 2001, semiconductor 
gains provide an increase factor of 
16 over 1995 computers. Better 
packaging and lower price margins 
through competition could provide 
another increase factor of two or 
three. The extra increase factor of 
20 for a petaflop is unclear. Based 
on today's results and rationales, a 
petaflop before its time is as bad an 
investment as the teraflop before its 
time. Evolution and market forces 
are just fine ... if we will just let 
them work. 
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